Reinforcement learning Research experiments OpenAI

RLD
REPORT TMEs + Project
Abdelraouf KESKES
January 2020

1 TME 1
the following is a figure depicting the cumulative Rewards / Regrets for several approaches, note
that the regrets are calculated according the ”the optimal strategy” where we choose always
the best arm at each timestamp, the max cumulative gain possible to achieve is the optimal
one whatever strategy we use, it will never bypass it . Note also that for LinUCB we used
α = 10,we will experiment this hyper-parameter further.
As expected the random one is a baseline, and it is far from other approaches in terms of
rewards and regrets.We could also notice that UCB is good but not that interesting in our case,
because we have a huge gap between UCB(red line) and the best strategy(green line) which
chooses the best arm according to the average cumulative gain through time where we choose
the latter, and we try to approach the this green line as close as possible. We could see that
UCB-V (UCB with a better bound including variance) is the closer to the best strategy. and
Also Linear UCB (context based) is very close to UCB-V and the best strategy.
Conclusion : in our case UCB-V is the best one.
I did also some experiments with different α values for LinUCB, and the following is a figure
illustrating them :
As we can see the performances are closer between different values, and from the regrets plot
we realize that the best α value is between 10 and 100, but since we stop it at 5000 iteration,
α = 100 was the best one at the last timestamp.
1

2 TME 2
the following are some images taken from several trials and experiments that we’ve done during
TME
after experimenting both algorithms Value iteration and policy iteration with a deterministic
and epsilon-greedy approaches . We also experiment several initializations for Policy iteration
algorithm (cf code : Uniform,Deterministic and Random), for all grid world arenas ( 0 to 10 )
we conclude that :
• policy iteration is faster than value iteration, as a policy converges more quickly than a
value function .
• the discount factor is a term which determines how much importance(weight) we give to
our rewards through time, for instance if we are interested only on the next step we could
put discount = 0 and obviously the higher our discount factor is the more importance
we give for further actions .Note that theoretically and even with experiments setting
discount = 1 could never converge especially if we penalize empty cases with a very small
penalty like −0.0001 or smaller, the agent could turn around empty cases infinitely !
• agent actions were almost all about empty cases reward and here is a non exhaustive
list of all arenas reporting the best reward value that we found in an epsilon-greedy
context using Value iteration algorithm :
0. Plan0 reward empty case = (−0.1)
2

9. Plan9 = env.getMDP() RecursionError: maximum recursion depth ex-
ceeded while getting the repr of an object
3 TME 3
Most of our experiments were done in plan7 of the grid world environment :
the following are the learning curve / average rewards through 1000 episodes using several
RL algorithms in their tabular version :
• Classical Q learning (off-policy) (the behavioral policy(Ex -greedy) is different from
the update policy(greedy MAX).
• Sarsa (Q learning on-policy) : where our behavioral policy is the same as the update
policy (for instance -greedy).
• Dyna-Q : a hybrid approach between Model-based methods where we try to estimate
the MDP through sampling and Q learning approaches which are valued based where we
focus on estimating a value function (for example Q[state,action])
3

a smoother version would be :
According to these experiments in Plan7 and using as hyper parameters
reward empty case = −0.1
discount factor = 0.99
learning rate = 0.1
− greedy = 0.1
learning rateDyna Q Model = 0.1
nsamplesDyna Q = 10
We could see that in these setting the 3 algorithms converge to almost the same number of
actions (the best solution) around 30 actions Q learning and Sarsa are showing approximately
the same curve and the same behavior with a slight advantage to Sarsa where we start the
learning process and a slight advantage to Q-learning at the end of the training process con-
verging to the optimal policy and giving better average rewards. However, Dyna-Q reduces very
quickly the number of steps (a kind of boosted learning) and obviously increases very quickly
the average rewards, but after 200 episodes it started being bypassed by purely value-based
methods, and continue increasing less quickly comparing to Q-learning/Sarsa which stabilizes
their average rewards after 400 episodes .We intuitively add that Dyna-Q requires more time
for training due to the MDP estimation.
4

4 TME 4
Deep Q learning, leverages advances in deep learning to learn policies in RL.Especially, when
we extend the number of states to a huge a infinite number (continuous case) .Since, neural
networks are universal approximators (Universal approximation theorem) we will utilize them
to approximate Q(state, action) . However, contrary to supervised learning, in RL we have two
main problems, during the training, we have :
• the target yj is not stable through time, so we introduce the Target Network prin-
ciple which is a second neural network on which we copy the online network weights(update)
every C (hyper parameter) steps,this will ensure a stable target at least during C steps.
• dependency between states (s1, a1, s2), (s2, a2, s3), ... (i.i.d hypothesis) to break this
dependency we will introduce a memory called Replay Memory, fill it until its capac-
ity, and sample randomly batches from it while training, following Supervised learning
paradigm, will ensure very low chances to sample a time-dependent batch, and even
though it happens it will not hurt learning.
4.1 CartPole
After implementing DQN, fine tuning it for CartPole, and training it we got the following result
:
hyper-parameters :
n episodes = 2000
hiddensize = [128]
learning rate = 0.001
− greedy = 1.0 → 0.05
n target steps = 100
Loss = MSE
batchsize = 64
memory capacity = 1000
As we can see globally the model is not stable wether for the loss or the number of actions,
our ultimate goal is to train the agent to achieve max number of actions which is 500. We can
5

see that in the first 300 episodes the learning was very slow increasing slightly, However after
episode 300 we gain a drastic gap of number of actions leading to 500 actions
We could also notice that after episode 500 we approximately have 3 chunky intervals where
the number of actions was at 95% of cases maximal(500) [500, 1000], [1100, 1600], [1800, 2000]
The behavior of the loss during training is not common, and very unstable with a lot of
oscillations
the following is a plot of our agent during the game
4.2 LunarLander
after fine tuning our DQN for LunarLander, and training it we got the following result :
As we can see the rewarding score is increasing through episodes which means that our agent
learns after several training crashes !
the following is one example of our agent’s performances :
hyper-parameters :
n episodes = 500
6

hiddensize = [128]
− greedy = 1.0 → 0.01
n target steps = 20
Loss = MSE
batchsize = 64
4.3 Grid World
Since experiments takes a long time, we focused on Plan1 to ensure that it works and the
agent will learn the best policy, and then switching to other plans will be only a matter of
hyper parameters tuning.This time, the task was not that straightforward, therefore it requires
some tricks to make it work .
the following is a figure illustrating the rewards scores / number of actions through episodes
Note that our goal is to maximize the reward which in our case would be
(-0.1)+1+(-0.1)+1=1.8
knowing that empty cases were rewarded as -0.1, yellow and green as +1, and red cell as -1
As we can see, at the beginning our agent was performing a lot of actions which leads to
decrease the reward score reaching almost -30, with 300 number of actions,however after around
80 episodes the agent started converging to the optimal policy reaching 1.8 of reward, with only
4 actions ! and obviously with some oscillations .
the following is the agent learned path in this grid world plan :
7

1 2 3
4 5
the used hyper parameters are :
n episodes = 500
hiddensize = [256, 30]
− greedy = 1.0 → 0.01
n target steps = 20
Loss = MSE
batchsize = 64
I’ve also added some learning decaying lr = lr/2 every 5 episodes
5 TME 5
The policy gradient methods goal is to model and optimize the policy directly. The policy is
usually modeled with a function(for instance a neural network) parameterized by θ w.r.t πθ(a|s)
. The value of the reward (our ultimate objective) depends on this policy. Several algorithms
were proposed, and in this TME we will use A2C , the latter has been shown to be able to
utilize GPUs more efficiently and work better with large batch sizes . Actor critic approaches
are based on 2 concepts :
8

• The “Critic” : estimates the value function. This could be the action-value (the Q
value) or state-value (the V value).
• The “Actor” : updates the policy distribution in the direction suggested by the Critic
(such as with policy gradients).
the following are our results after several experiments of A2C on CartPole game :
Globally, We could realize that if A2C is well trained, after several episodes the algorithm
start converging to the best solution with more stability, For example in CartPole achieving
500 actions ,which was not the case of DQN regarding stability and convergence .
the hyper parameters are :
n episodes = 5000
hiddensize = 128
batchsize = 128
6 TME 9 GANs
Since, I’ve enrolled RDFIA course with Pr Matthieu Cord, and passed several days imple-
menting and experimenting GANS and conditional GANs with their sensitive hyper
parameters.I decided to not redo it and report the directly from my previous work to not waste
time with something I learned and understood very well.
9

DCGANS :
Figure 1: GANs generation results through learning process
Figure 2: GANs Losses through learning process
• Generations get more smooth and realistic through iterations but after half the iterations
the results are almost the same and they are not really improving .
• As expected , the Generator loss is decreasing and You the discriminator loss is increasing,
it means that our Generator successfully generates images that our discriminator fails to
catch .
• there is no stability ... the model keep oscillating .
• images are very diverse in terms of background ( darker, lighter ), skin color, hair style,
gender, ... but they still are not very realistic .
after doing a lot of experiments , We conclude that :
• GANs are extremely sensitive to the learning rate , a slight change by 0.0001 or 0.0002
could lead to very slow convergence or divergence (instability) . Additionally , we have
to decrease the learning rate ( learning rate decay ) while training , because the learning
rate that we needed to generate smooth textures from randomness is not the same as
trying to render a correct face with all coherent details .
10

• Increasing the momentum β1 to the default value 0.9 ( approximately we calculate our
moving average over 10 recent gradients ) resulted in training oscillation and instability
while reducing it to 0.5 ( moving average over 2 recent gradients ) helped stabilize training
.
• batch size 128 and 256 turn out to be a great trade off , We tried with 512 and 64 and
the results were not generated at all after a lot of iterations , thus we stopped it and turn
it back to 128/256 .
• training the model longer does not necessarily implies better practical performances ,
most of times
• balance nbStepsD and nbStepsG every step taken down the hill changes the entire land-
scape a little. It’s a dynamic system where the optimization process is seeking not a
minimum, but a ”nash equilibrium” between two forces. We put nbupdateD = 10 and
we realized that the training experience was going better and better and we got rapidly
plausible images .
• noise size = 100 is a good heuristic that works , we tried with 10 and the results were bad,
we guess that for MNIST data researchers needed 100 so for faces we will need at least
100 , with 1000 we got on error of shapes for 32 images and architecture . We extend it
to 512 ( max ) and nothing specifically relevant has been noticed .
• after passing to 64 × 64 images and extending our architectures , we realized that GANs
has a strong potential , to fit distributions smoothly on highly dimension data , and our
results are the following :
Figure 3: GANs final result on 64 x 64 images
11

cDCGANS : the only different is that we will deal with the joint distribution PX,Y (x, y)
instead of PX(x)
Figure 4: cDCGANs generation results through learning process on MNIST
Figure 5: cDCGANs Losses through learning process on MNIST
Figure 6: cDCGANs final result after 20 epochs on MNIST
12

• the results are extremely realistic
• Generator loss decreases and stabilizes perfectly
• The Discriminator is almost unable to distinct between real images and fakes ones, his
loss increases and then converges .
• decreasing the learning rate helps a lot for a smoother learning , however I think that
after many epoches, the generator wasn’t able to move on and find a better local minima,
it seems to stick to a local minima, because of the extremely small learning rate . Hence,
the mess of very few example ( 0 dotted , 2 encapsulated , 7 dense and rotated , and
finally a circular 4 )
7 TME 10(VAEs)
the following is the evolution of results encoded in a 2D space using a VAE : the learning is very
1 2 3
smooth according to the average loss function (avg/epoch) after finding good hyper parameters
:
n epochs = 10
latentdim = 2
batchsize = 128
13

As we can see VAE suffered from a problem of blur, after fine tuning our model, the results are
somehow realistic but sometimes blurry, After studying the effect of the hidden dimension in
our Linear FeedForward Network we end up with these findings :
2D denoising :
5D denoising :
20D denoising :
14

Conclusion : The more we increase the encoding space the sharper the decoding is and
the better the reconstruction is .
In addition, we have scattered MNIST test data on 2D space in the case of 2D encoding :
the constructed clusters on unseen data (MNIST test dataset) are very plausible and realistic
15

8 Project
our RTS game looks like the following figure :
our task is to gather the maximum number of golds
our reward formula for each step :
difference gold = current gold − previous gold
reward = α ∗ difference gold + β ∗ nearest cell gold
with :
• nearest cell gold defined as the distance between our agent and the nearest cell containing
gold.
• α and β are hyper parameters which we experiment, scale, etc ...
1. We have started with DQN, it was very slow to train, and except the code we have
nothing to report about it
2. We then tried A2C algorithm for several trials, and the agent was unable to harvest at
least one gold, which was very unexpected !
We saved one of the results our experiments :
16

We tried different α,β and it does not work
We changed the reward formula several times and it does not work
We believe that with more computation power, more experiments with other algorithms,
and parameters hyper tuning it could work .
17

Reinforcement learning Research experiments OpenAI

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Reinforcement learning Research experiments OpenAI

Similar to Reinforcement learning Research experiments OpenAI (20)

More from Raouf KESKES

More from Raouf KESKES (7)

Recently uploaded

Recently uploaded (20)

Reinforcement learning Research experiments OpenAI