SlideShare a Scribd company logo
1 of 18
Download to read offline
RLD
REPORT TMEs + Project
Abdelraouf KESKES
January 2020
1 TME 1
the following is a figure depicting the cumulative Rewards / Regrets for several approaches, note
that the regrets are calculated according the ”the optimal strategy” where we choose always
the best arm at each timestamp, the max cumulative gain possible to achieve is the optimal
one whatever strategy we use, it will never bypass it . Note also that for LinUCB we used
α = 10,we will experiment this hyper-parameter further.
As expected the random one is a baseline, and it is far from other approaches in terms of
rewards and regrets.We could also notice that UCB is good but not that interesting in our case,
because we have a huge gap between UCB(red line) and the best strategy(green line) which
chooses the best arm according to the average cumulative gain through time where we choose
the latter, and we try to approach the this green line as close as possible. We could see that
UCB-V (UCB with a better bound including variance) is the closer to the best strategy. and
Also Linear UCB (context based) is very close to UCB-V and the best strategy.
Conclusion : in our case UCB-V is the best one.
I did also some experiments with different α values for LinUCB, and the following is a figure
illustrating them :
As we can see the performances are closer between different values, and from the regrets plot
we realize that the best α value is between 10 and 100, but since we stop it at 5000 iteration,
α = 100 was the best one at the last timestamp.
1
2 TME 2
the following are some images taken from several trials and experiments that we’ve done during
TME
after experimenting both algorithms Value iteration and policy iteration with a deterministic
and epsilon-greedy approaches . We also experiment several initializations for Policy iteration
algorithm (cf code : Uniform,Deterministic and Random), for all grid world arenas ( 0 to 10 )
we conclude that :
• policy iteration is faster than value iteration, as a policy converges more quickly than a
value function .
• the discount factor is a term which determines how much importance(weight) we give to
our rewards through time, for instance if we are interested only on the next step we could
put discount = 0 and obviously the higher our discount factor is the more importance
we give for further actions .Note that theoretically and even with experiments setting
discount = 1 could never converge especially if we penalize empty cases with a very small
penalty like −0.0001 or smaller, the agent could turn around empty cases infinitely !
• agent actions were almost all about empty cases reward and here is a non exhaustive
list of all arenas reporting the best reward value that we found in an epsilon-greedy
context using Value iteration algorithm :
0. Plan0 reward empty case = (−0.1)
1. Plan1 reward empty case = (−0.1)
2
2. Plan2 reward empty case = (−0.01)
3. Plan3 reward empty case = (−0.001)
4. Plan4 reward empty case = (−0.1)
5. Plan5 reward empty case = (−0.001)
6. Plan6 reward empty case = (−0.001)
7. Plan7 reward empty case = (−0.01)
8. Plan8 reward empty case = (−0.01)
9. Plan9 = env.getMDP() RecursionError: maximum recursion depth ex-
ceeded while getting the repr of an object
10. Plan10 reward empty case = (−0.001)
3 TME 3
Most of our experiments were done in plan7 of the grid world environment :
the following are the learning curve / average rewards through 1000 episodes using several
RL algorithms in their tabular version :
• Classical Q learning (off-policy) (the behavioral policy(Ex -greedy) is different from
the update policy(greedy MAX).
• Sarsa (Q learning on-policy) : where our behavioral policy is the same as the update
policy (for instance -greedy).
• Dyna-Q : a hybrid approach between Model-based methods where we try to estimate
the MDP through sampling and Q learning approaches which are valued based where we
focus on estimating a value function (for example Q[state,action])
3
a smoother version would be :
According to these experiments in Plan7 and using as hyper parameters
reward empty case = −0.1
discount factor = 0.99
learning rate = 0.1
 − greedy = 0.1
learning rateDyna Q Model = 0.1
nsamplesDyna Q = 10
We could see that in these setting the 3 algorithms converge to almost the same number of
actions (the best solution) around 30 actions Q learning and Sarsa are showing approximately
the same curve and the same behavior with a slight advantage to Sarsa where we start the
learning process and a slight advantage to Q-learning at the end of the training process con-
verging to the optimal policy and giving better average rewards. However, Dyna-Q reduces very
quickly the number of steps (a kind of boosted learning) and obviously increases very quickly
the average rewards, but after 200 episodes it started being bypassed by purely value-based
methods, and continue increasing less quickly comparing to Q-learning/Sarsa which stabilizes
their average rewards after 400 episodes .We intuitively add that Dyna-Q requires more time
for training due to the MDP estimation.
4
4 TME 4
Deep Q learning, leverages advances in deep learning to learn policies in RL.Especially, when
we extend the number of states to a huge a infinite number (continuous case) .Since, neural
networks are universal approximators (Universal approximation theorem) we will utilize them
to approximate Q(state, action) . However, contrary to supervised learning, in RL we have two
main problems, during the training, we have :
• the target yj is not stable through time, so we introduce the Target Network prin-
ciple which is a second neural network on which we copy the online network weights(update)
every C (hyper parameter) steps,this will ensure a stable target at least during C steps.
• dependency between states (s1, a1, s2), (s2, a2, s3), ... (i.i.d hypothesis) to break this
dependency we will introduce a memory called Replay Memory, fill it until its capac-
ity, and sample randomly batches from it while training, following Supervised learning
paradigm, will ensure very low chances to sample a time-dependent batch, and even
though it happens it will not hurt learning.
4.1 CartPole
After implementing DQN, fine tuning it for CartPole, and training it we got the following result
:
hyper-parameters :
n episodes = 2000
hiddensize = [128]
discount factor = 0.999
learning rate = 0.001
 − greedy = 1.0 → 0.05
n target steps = 100
Loss = MSE
batchsize = 64
memory capacity = 1000
As we can see globally the model is not stable wether for the loss or the number of actions,
our ultimate goal is to train the agent to achieve max number of actions which is 500. We can
5
see that in the first 300 episodes the learning was very slow increasing slightly, However after
episode 300 we gain a drastic gap of number of actions leading to 500 actions
We could also notice that after episode 500 we approximately have 3 chunky intervals where
the number of actions was at 95% of cases maximal(500) [500, 1000], [1100, 1600], [1800, 2000]
The behavior of the loss during training is not common, and very unstable with a lot of
oscillations
the following is a plot of our agent during the game
4.2 LunarLander
after fine tuning our DQN for LunarLander, and training it we got the following result :
As we can see the rewarding score is increasing through episodes which means that our agent
learns after several training crashes !
the following is one example of our agent’s performances :
hyper-parameters :
n episodes = 500
6
hiddensize = [128]
discount factor = 0.99
learning rate = 0.0005
 − greedy = 1.0 → 0.01
n target steps = 20
Loss = MSE
batchsize = 64
memory capacity = 1000
4.3 Grid World
Since experiments takes a long time, we focused on Plan1 to ensure that it works and the
agent will learn the best policy, and then switching to other plans will be only a matter of
hyper parameters tuning.This time, the task was not that straightforward, therefore it requires
some tricks to make it work .
the following is a figure illustrating the rewards scores / number of actions through episodes
Note that our goal is to maximize the reward which in our case would be
(-0.1)+1+(-0.1)+1=1.8
knowing that empty cases were rewarded as -0.1, yellow and green as +1, and red cell as -1
As we can see, at the beginning our agent was performing a lot of actions which leads to
decrease the reward score reaching almost -30, with 300 number of actions,however after around
80 episodes the agent started converging to the optimal policy reaching 1.8 of reward, with only
4 actions ! and obviously with some oscillations .
the following is the agent learned path in this grid world plan :
7
1 2 3
4 5
the used hyper parameters are :
n episodes = 500
hiddensize = [256, 30]
discount factor = 0.99
learning rate = 0.0001
 − greedy = 1.0 → 0.01
n target steps = 20
Loss = MSE
batchsize = 64
memory capacity = 1000
I’ve also added some learning decaying lr = lr/2 every 5 episodes
5 TME 5
The policy gradient methods goal is to model and optimize the policy directly. The policy is
usually modeled with a function(for instance a neural network) parameterized by θ w.r.t πθ(a|s)
. The value of the reward (our ultimate objective) depends on this policy. Several algorithms
were proposed, and in this TME we will use A2C , the latter has been shown to be able to
utilize GPUs more efficiently and work better with large batch sizes . Actor critic approaches
are based on 2 concepts :
8
• The “Critic” : estimates the value function. This could be the action-value (the Q
value) or state-value (the V value).
• The “Actor” : updates the policy distribution in the direction suggested by the Critic
(such as with policy gradients).
the following are our results after several experiments of A2C on CartPole game :
Globally, We could realize that if A2C is well trained, after several episodes the algorithm
start converging to the best solution with more stability, For example in CartPole achieving
500 actions ,which was not the case of DQN regarding stability and convergence .
the hyper parameters are :
n episodes = 5000
hiddensize = 128
discount factor = 0.999
learning rate = 0.0005
batchsize = 128
6 TME 9 GANs
Since, I’ve enrolled RDFIA course with Pr Matthieu Cord, and passed several days imple-
menting and experimenting GANS and conditional GANs with their sensitive hyper
parameters.I decided to not redo it and report the directly from my previous work to not waste
time with something I learned and understood very well.
9
DCGANS :
Figure 1: GANs generation results through learning process
Figure 2: GANs Losses through learning process
• Generations get more smooth and realistic through iterations but after half the iterations
the results are almost the same and they are not really improving .
• As expected , the Generator loss is decreasing and You the discriminator loss is increasing,
it means that our Generator successfully generates images that our discriminator fails to
catch .
• there is no stability ... the model keep oscillating .
• images are very diverse in terms of background ( darker, lighter ), skin color, hair style,
gender, ... but they still are not very realistic .
after doing a lot of experiments , We conclude that :
• GANs are extremely sensitive to the learning rate , a slight change by 0.0001 or 0.0002
could lead to very slow convergence or divergence (instability) . Additionally , we have
to decrease the learning rate ( learning rate decay ) while training , because the learning
rate that we needed to generate smooth textures from randomness is not the same as
trying to render a correct face with all coherent details .
10
• Increasing the momentum β1 to the default value 0.9 ( approximately we calculate our
moving average over 10 recent gradients ) resulted in training oscillation and instability
while reducing it to 0.5 ( moving average over 2 recent gradients ) helped stabilize training
.
• batch size 128 and 256 turn out to be a great trade off , We tried with 512 and 64 and
the results were not generated at all after a lot of iterations , thus we stopped it and turn
it back to 128/256 .
• training the model longer does not necessarily implies better practical performances ,
most of times
• balance nbStepsD and nbStepsG every step taken down the hill changes the entire land-
scape a little. It’s a dynamic system where the optimization process is seeking not a
minimum, but a ”nash equilibrium” between two forces. We put nbupdateD = 10 and
we realized that the training experience was going better and better and we got rapidly
plausible images .
• noise size = 100 is a good heuristic that works , we tried with 10 and the results were bad,
we guess that for MNIST data researchers needed 100 so for faces we will need at least
100 , with 1000 we got on error of shapes for 32 images and architecture . We extend it
to 512 ( max ) and nothing specifically relevant has been noticed .
• after passing to 64 × 64 images and extending our architectures , we realized that GANs
has a strong potential , to fit distributions smoothly on highly dimension data , and our
results are the following :
Figure 3: GANs final result on 64 x 64 images
11
cDCGANS : the only different is that we will deal with the joint distribution PX,Y (x, y)
instead of PX(x)
Figure 4: cDCGANs generation results through learning process on MNIST
Figure 5: cDCGANs Losses through learning process on MNIST
Figure 6: cDCGANs final result after 20 epochs on MNIST
12
• the results are extremely realistic
• Generator loss decreases and stabilizes perfectly
• The Discriminator is almost unable to distinct between real images and fakes ones, his
loss increases and then converges .
• decreasing the learning rate helps a lot for a smoother learning , however I think that
after many epoches, the generator wasn’t able to move on and find a better local minima,
it seems to stick to a local minima, because of the extremely small learning rate . Hence,
the mess of very few example ( 0 dotted , 2 encapsulated , 7 dense and rotated , and
finally a circular 4 )
7 TME 10(VAEs)
the following is the evolution of results encoded in a 2D space using a VAE : the learning is very
1 2 3
smooth according to the average loss function (avg/epoch) after finding good hyper parameters
:
n epochs = 10
latentdim = 2
learning rate = 0.01
batchsize = 128
13
As we can see VAE suffered from a problem of blur, after fine tuning our model, the results are
somehow realistic but sometimes blurry, After studying the effect of the hidden dimension in
our Linear FeedForward Network we end up with these findings :
2D denoising :
5D denoising :
20D denoising :
14
Conclusion : The more we increase the encoding space the sharper the decoding is and
the better the reconstruction is .
In addition, we have scattered MNIST test data on 2D space in the case of 2D encoding :
the constructed clusters on unseen data (MNIST test dataset) are very plausible and realistic
15
8 Project
our RTS game looks like the following figure :
our task is to gather the maximum number of golds
our reward formula for each step :
difference gold = current gold − previous gold
reward = α ∗ difference gold + β ∗ nearest cell gold
with :
• nearest cell gold defined as the distance between our agent and the nearest cell containing
gold.
• α and β are hyper parameters which we experiment, scale, etc ...
1. We have started with DQN, it was very slow to train, and except the code we have
nothing to report about it
2. We then tried A2C algorithm for several trials, and the agent was unable to harvest at
least one gold, which was very unexpected !
We saved one of the results our experiments :
16
We tried different α,β and it does not work
We changed the reward formula several times and it does not work
We believe that with more computation power, more experiments with other algorithms,
and parameters hyper tuning it could work .
17

More Related Content

What's hot

Streaming Python on Hadoop
Streaming Python on HadoopStreaming Python on Hadoop
Streaming Python on Hadoop
Vivian S. Zhang
 

What's hot (20)

Ensembling & Boosting 概念介紹
Ensembling & Boosting  概念介紹Ensembling & Boosting  概念介紹
Ensembling & Boosting 概念介紹
 
DCSM report2
DCSM report2DCSM report2
DCSM report2
 
Caret Package for R
Caret Package for RCaret Package for R
Caret Package for R
 
Text cnn on acme ugc moderation
Text cnn on acme ugc moderationText cnn on acme ugc moderation
Text cnn on acme ugc moderation
 
The caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive ModelsThe caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive Models
 
The Validity of CNN to Time-Series Forecasting Problem
The Validity of CNN to Time-Series Forecasting ProblemThe Validity of CNN to Time-Series Forecasting Problem
The Validity of CNN to Time-Series Forecasting Problem
 
GBM theory code and parameters
GBM theory code and parametersGBM theory code and parameters
GBM theory code and parameters
 
Transfer Learning for Improving Model Predictions in Highly Configurable Soft...
Transfer Learning for Improving Model Predictions in Highly Configurable Soft...Transfer Learning for Improving Model Predictions in Highly Configurable Soft...
Transfer Learning for Improving Model Predictions in Highly Configurable Soft...
 
Comparison Study of Decision Tree Ensembles for Regression
Comparison Study of Decision Tree Ensembles for RegressionComparison Study of Decision Tree Ensembles for Regression
Comparison Study of Decision Tree Ensembles for Regression
 
ゆるふわ強化学習入門
ゆるふわ強化学習入門ゆるふわ強化学習入門
ゆるふわ強化学習入門
 
Faster and cheaper, smart ab experiments - public ver.
Faster and cheaper, smart ab experiments - public ver.Faster and cheaper, smart ab experiments - public ver.
Faster and cheaper, smart ab experiments - public ver.
 
The caret package is a unified interface to a large number of predictive mode...
The caret package is a unified interface to a large number of predictive mode...The caret package is a unified interface to a large number of predictive mode...
The caret package is a unified interface to a large number of predictive mode...
 
Fuzzy Self-Learning Controllers for Elasticity Management in Dynamic Cloud Ar...
Fuzzy Self-Learning Controllers for Elasticity Management in Dynamic Cloud Ar...Fuzzy Self-Learning Controllers for Elasticity Management in Dynamic Cloud Ar...
Fuzzy Self-Learning Controllers for Elasticity Management in Dynamic Cloud Ar...
 
Data Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science CompetitionsData Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science Competitions
 
Machine Learning meets DevOps
Machine Learning meets DevOpsMachine Learning meets DevOps
Machine Learning meets DevOps
 
Streaming Python on Hadoop
Streaming Python on HadoopStreaming Python on Hadoop
Streaming Python on Hadoop
 
An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...
An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...
An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...
 
Transfer Learning for Improving Model Predictions in Robotic Systems
Transfer Learning for Improving Model Predictions  in Robotic SystemsTransfer Learning for Improving Model Predictions  in Robotic Systems
Transfer Learning for Improving Model Predictions in Robotic Systems
 
Lesson 39
Lesson 39Lesson 39
Lesson 39
 
report
reportreport
report
 

Similar to Reinforcement learning Research experiments OpenAI

Machine learning (5)
Machine learning (5)Machine learning (5)
Machine learning (5)
NYversity
 
NITW_Improving Deep Neural Networks.pptx
NITW_Improving Deep Neural Networks.pptxNITW_Improving Deep Neural Networks.pptx
NITW_Improving Deep Neural Networks.pptx
ssuserd23711
 
Ai_Project_report
Ai_Project_reportAi_Project_report
Ai_Project_report
Ravi Gupta
 

Similar to Reinforcement learning Research experiments OpenAI (20)

Playing Atari with Deep Reinforcement Learning
Playing Atari with Deep Reinforcement LearningPlaying Atari with Deep Reinforcement Learning
Playing Atari with Deep Reinforcement Learning
 
INTRODUCTION TO BOOSTING.ppt
INTRODUCTION TO BOOSTING.pptINTRODUCTION TO BOOSTING.ppt
INTRODUCTION TO BOOSTING.ppt
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
 
Explore ml day 2
Explore ml day 2Explore ml day 2
Explore ml day 2
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative ways
 
Machine learning (5)
Machine learning (5)Machine learning (5)
Machine learning (5)
 
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Daa unit 1
Daa unit 1Daa unit 1
Daa unit 1
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Essentials of machine learning algorithms
Essentials of machine learning algorithmsEssentials of machine learning algorithms
Essentials of machine learning algorithms
 
[M3A4] Data Analysis and Interpretation Specialization
[M3A4] Data Analysis and Interpretation Specialization[M3A4] Data Analysis and Interpretation Specialization
[M3A4] Data Analysis and Interpretation Specialization
 
NITW_Improving Deep Neural Networks (1).pptx
NITW_Improving Deep Neural Networks (1).pptxNITW_Improving Deep Neural Networks (1).pptx
NITW_Improving Deep Neural Networks (1).pptx
 
NITW_Improving Deep Neural Networks.pptx
NITW_Improving Deep Neural Networks.pptxNITW_Improving Deep Neural Networks.pptx
NITW_Improving Deep Neural Networks.pptx
 
large scale Machine learning
large scale Machine learninglarge scale Machine learning
large scale Machine learning
 
Introduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement LearningIntroduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement Learning
 
Kaggle KDD Cup Report
Kaggle KDD Cup ReportKaggle KDD Cup Report
Kaggle KDD Cup Report
 
Ai_Project_report
Ai_Project_reportAi_Project_report
Ai_Project_report
 

More from Raouf KESKES

The wise doc_trans presentation
The wise doc_trans presentationThe wise doc_trans presentation
The wise doc_trans presentation
Raouf KESKES
 
Multi-label Unbalanced Deezer Streaming Classification Report
Multi-label Unbalanced Deezer Streaming Classification  ReportMulti-label Unbalanced Deezer Streaming Classification  Report
Multi-label Unbalanced Deezer Streaming Classification Report
Raouf KESKES
 
Multi Label Deezer Streaming Classification
Multi Label Deezer Streaming ClassificationMulti Label Deezer Streaming Classification
Multi Label Deezer Streaming Classification
Raouf KESKES
 
Machine Learning Interpretability / Explainability
Machine Learning Interpretability / ExplainabilityMachine Learning Interpretability / Explainability
Machine Learning Interpretability / Explainability
Raouf KESKES
 
Reds interpretability report
Reds interpretability reportReds interpretability report
Reds interpretability report
Raouf KESKES
 

More from Raouf KESKES (7)

Master thesis
Master thesisMaster thesis
Master thesis
 
The wise doc_trans presentation
The wise doc_trans presentationThe wise doc_trans presentation
The wise doc_trans presentation
 
Multi-label Unbalanced Deezer Streaming Classification Report
Multi-label Unbalanced Deezer Streaming Classification  ReportMulti-label Unbalanced Deezer Streaming Classification  Report
Multi-label Unbalanced Deezer Streaming Classification Report
 
Multi Label Deezer Streaming Classification
Multi Label Deezer Streaming ClassificationMulti Label Deezer Streaming Classification
Multi Label Deezer Streaming Classification
 
Machine Learning Interpretability / Explainability
Machine Learning Interpretability / ExplainabilityMachine Learning Interpretability / Explainability
Machine Learning Interpretability / Explainability
 
Reds interpretability report
Reds interpretability reportReds interpretability report
Reds interpretability report
 
Reds presentation ml_interpretability_raouf_aurelia
Reds presentation ml_interpretability_raouf_aureliaReds presentation ml_interpretability_raouf_aurelia
Reds presentation ml_interpretability_raouf_aurelia
 

Recently uploaded

Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
VictoriaMetrics
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 

Recently uploaded (20)

OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
WSO2Con2024 - Simplified Integration: Unveiling the Latest Features in WSO2 L...
WSO2Con2024 - Simplified Integration: Unveiling the Latest Features in WSO2 L...WSO2Con2024 - Simplified Integration: Unveiling the Latest Features in WSO2 L...
WSO2Con2024 - Simplified Integration: Unveiling the Latest Features in WSO2 L...
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
What Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationWhat Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the Situation
 
WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...
WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...
WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...
 
WSO2CON 2024 - Not Just Microservices: Rightsize Your Services!
WSO2CON 2024 - Not Just Microservices: Rightsize Your Services!WSO2CON 2024 - Not Just Microservices: Rightsize Your Services!
WSO2CON 2024 - Not Just Microservices: Rightsize Your Services!
 
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptx
 
WSO2Con2024 - Facilitating Broadband Switching Services for UK Telecoms Provi...
WSO2Con2024 - Facilitating Broadband Switching Services for UK Telecoms Provi...WSO2Con2024 - Facilitating Broadband Switching Services for UK Telecoms Provi...
WSO2Con2024 - Facilitating Broadband Switching Services for UK Telecoms Provi...
 
WSO2CON 2024 - IoT Needs CIAM: The Importance of Centralized IAM in a Growing...
WSO2CON 2024 - IoT Needs CIAM: The Importance of Centralized IAM in a Growing...WSO2CON 2024 - IoT Needs CIAM: The Importance of Centralized IAM in a Growing...
WSO2CON 2024 - IoT Needs CIAM: The Importance of Centralized IAM in a Growing...
 
WSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security ProgramWSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security Program
 
WSO2Con2024 - Organization Management: The Revolution in B2B CIAM
WSO2Con2024 - Organization Management: The Revolution in B2B CIAMWSO2Con2024 - Organization Management: The Revolution in B2B CIAM
WSO2Con2024 - Organization Management: The Revolution in B2B CIAM
 
WSO2CON 2024 Slides - Unlocking Value with AI
WSO2CON 2024 Slides - Unlocking Value with AIWSO2CON 2024 Slides - Unlocking Value with AI
WSO2CON 2024 Slides - Unlocking Value with AI
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
 
WSO2CON 2024 - Lessons from the Field: Legacy Platforms – It's Time to Let Go...
WSO2CON 2024 - Lessons from the Field: Legacy Platforms – It's Time to Let Go...WSO2CON 2024 - Lessons from the Field: Legacy Platforms – It's Time to Let Go...
WSO2CON 2024 - Lessons from the Field: Legacy Platforms – It's Time to Let Go...
 
WSO2CON 2024 - How CSI Piemonte Is Apifying the Public Administration
WSO2CON 2024 - How CSI Piemonte Is Apifying the Public AdministrationWSO2CON 2024 - How CSI Piemonte Is Apifying the Public Administration
WSO2CON 2024 - How CSI Piemonte Is Apifying the Public Administration
 
WSO2CON 2024 - How CSI Piemonte Is Apifying the Public Administration
WSO2CON 2024 - How CSI Piemonte Is Apifying the Public AdministrationWSO2CON 2024 - How CSI Piemonte Is Apifying the Public Administration
WSO2CON 2024 - How CSI Piemonte Is Apifying the Public Administration
 
WSO2CON 2024 - Architecting AI in the Enterprise: APIs and Applications
WSO2CON 2024 - Architecting AI in the Enterprise: APIs and ApplicationsWSO2CON 2024 - Architecting AI in the Enterprise: APIs and Applications
WSO2CON 2024 - Architecting AI in the Enterprise: APIs and Applications
 
AzureNativeQumulo_HPC_Cloud_Native_Benchmarks.pdf
AzureNativeQumulo_HPC_Cloud_Native_Benchmarks.pdfAzureNativeQumulo_HPC_Cloud_Native_Benchmarks.pdf
AzureNativeQumulo_HPC_Cloud_Native_Benchmarks.pdf
 

Reinforcement learning Research experiments OpenAI

  • 1. RLD REPORT TMEs + Project Abdelraouf KESKES January 2020
  • 2. 1 TME 1 the following is a figure depicting the cumulative Rewards / Regrets for several approaches, note that the regrets are calculated according the ”the optimal strategy” where we choose always the best arm at each timestamp, the max cumulative gain possible to achieve is the optimal one whatever strategy we use, it will never bypass it . Note also that for LinUCB we used α = 10,we will experiment this hyper-parameter further. As expected the random one is a baseline, and it is far from other approaches in terms of rewards and regrets.We could also notice that UCB is good but not that interesting in our case, because we have a huge gap between UCB(red line) and the best strategy(green line) which chooses the best arm according to the average cumulative gain through time where we choose the latter, and we try to approach the this green line as close as possible. We could see that UCB-V (UCB with a better bound including variance) is the closer to the best strategy. and Also Linear UCB (context based) is very close to UCB-V and the best strategy. Conclusion : in our case UCB-V is the best one. I did also some experiments with different α values for LinUCB, and the following is a figure illustrating them : As we can see the performances are closer between different values, and from the regrets plot we realize that the best α value is between 10 and 100, but since we stop it at 5000 iteration, α = 100 was the best one at the last timestamp. 1
  • 3. 2 TME 2 the following are some images taken from several trials and experiments that we’ve done during TME after experimenting both algorithms Value iteration and policy iteration with a deterministic and epsilon-greedy approaches . We also experiment several initializations for Policy iteration algorithm (cf code : Uniform,Deterministic and Random), for all grid world arenas ( 0 to 10 ) we conclude that : • policy iteration is faster than value iteration, as a policy converges more quickly than a value function . • the discount factor is a term which determines how much importance(weight) we give to our rewards through time, for instance if we are interested only on the next step we could put discount = 0 and obviously the higher our discount factor is the more importance we give for further actions .Note that theoretically and even with experiments setting discount = 1 could never converge especially if we penalize empty cases with a very small penalty like −0.0001 or smaller, the agent could turn around empty cases infinitely ! • agent actions were almost all about empty cases reward and here is a non exhaustive list of all arenas reporting the best reward value that we found in an epsilon-greedy context using Value iteration algorithm : 0. Plan0 reward empty case = (−0.1) 1. Plan1 reward empty case = (−0.1) 2
  • 4. 2. Plan2 reward empty case = (−0.01) 3. Plan3 reward empty case = (−0.001) 4. Plan4 reward empty case = (−0.1) 5. Plan5 reward empty case = (−0.001) 6. Plan6 reward empty case = (−0.001) 7. Plan7 reward empty case = (−0.01) 8. Plan8 reward empty case = (−0.01) 9. Plan9 = env.getMDP() RecursionError: maximum recursion depth ex- ceeded while getting the repr of an object 10. Plan10 reward empty case = (−0.001) 3 TME 3 Most of our experiments were done in plan7 of the grid world environment : the following are the learning curve / average rewards through 1000 episodes using several RL algorithms in their tabular version : • Classical Q learning (off-policy) (the behavioral policy(Ex -greedy) is different from the update policy(greedy MAX). • Sarsa (Q learning on-policy) : where our behavioral policy is the same as the update policy (for instance -greedy). • Dyna-Q : a hybrid approach between Model-based methods where we try to estimate the MDP through sampling and Q learning approaches which are valued based where we focus on estimating a value function (for example Q[state,action]) 3
  • 5. a smoother version would be : According to these experiments in Plan7 and using as hyper parameters reward empty case = −0.1 discount factor = 0.99 learning rate = 0.1 − greedy = 0.1 learning rateDyna Q Model = 0.1 nsamplesDyna Q = 10 We could see that in these setting the 3 algorithms converge to almost the same number of actions (the best solution) around 30 actions Q learning and Sarsa are showing approximately the same curve and the same behavior with a slight advantage to Sarsa where we start the learning process and a slight advantage to Q-learning at the end of the training process con- verging to the optimal policy and giving better average rewards. However, Dyna-Q reduces very quickly the number of steps (a kind of boosted learning) and obviously increases very quickly the average rewards, but after 200 episodes it started being bypassed by purely value-based methods, and continue increasing less quickly comparing to Q-learning/Sarsa which stabilizes their average rewards after 400 episodes .We intuitively add that Dyna-Q requires more time for training due to the MDP estimation. 4
  • 6. 4 TME 4 Deep Q learning, leverages advances in deep learning to learn policies in RL.Especially, when we extend the number of states to a huge a infinite number (continuous case) .Since, neural networks are universal approximators (Universal approximation theorem) we will utilize them to approximate Q(state, action) . However, contrary to supervised learning, in RL we have two main problems, during the training, we have : • the target yj is not stable through time, so we introduce the Target Network prin- ciple which is a second neural network on which we copy the online network weights(update) every C (hyper parameter) steps,this will ensure a stable target at least during C steps. • dependency between states (s1, a1, s2), (s2, a2, s3), ... (i.i.d hypothesis) to break this dependency we will introduce a memory called Replay Memory, fill it until its capac- ity, and sample randomly batches from it while training, following Supervised learning paradigm, will ensure very low chances to sample a time-dependent batch, and even though it happens it will not hurt learning. 4.1 CartPole After implementing DQN, fine tuning it for CartPole, and training it we got the following result : hyper-parameters : n episodes = 2000 hiddensize = [128] discount factor = 0.999 learning rate = 0.001 − greedy = 1.0 → 0.05 n target steps = 100 Loss = MSE batchsize = 64 memory capacity = 1000 As we can see globally the model is not stable wether for the loss or the number of actions, our ultimate goal is to train the agent to achieve max number of actions which is 500. We can 5
  • 7. see that in the first 300 episodes the learning was very slow increasing slightly, However after episode 300 we gain a drastic gap of number of actions leading to 500 actions We could also notice that after episode 500 we approximately have 3 chunky intervals where the number of actions was at 95% of cases maximal(500) [500, 1000], [1100, 1600], [1800, 2000] The behavior of the loss during training is not common, and very unstable with a lot of oscillations the following is a plot of our agent during the game 4.2 LunarLander after fine tuning our DQN for LunarLander, and training it we got the following result : As we can see the rewarding score is increasing through episodes which means that our agent learns after several training crashes ! the following is one example of our agent’s performances : hyper-parameters : n episodes = 500 6
  • 8. hiddensize = [128] discount factor = 0.99 learning rate = 0.0005 − greedy = 1.0 → 0.01 n target steps = 20 Loss = MSE batchsize = 64 memory capacity = 1000 4.3 Grid World Since experiments takes a long time, we focused on Plan1 to ensure that it works and the agent will learn the best policy, and then switching to other plans will be only a matter of hyper parameters tuning.This time, the task was not that straightforward, therefore it requires some tricks to make it work . the following is a figure illustrating the rewards scores / number of actions through episodes Note that our goal is to maximize the reward which in our case would be (-0.1)+1+(-0.1)+1=1.8 knowing that empty cases were rewarded as -0.1, yellow and green as +1, and red cell as -1 As we can see, at the beginning our agent was performing a lot of actions which leads to decrease the reward score reaching almost -30, with 300 number of actions,however after around 80 episodes the agent started converging to the optimal policy reaching 1.8 of reward, with only 4 actions ! and obviously with some oscillations . the following is the agent learned path in this grid world plan : 7
  • 9. 1 2 3 4 5 the used hyper parameters are : n episodes = 500 hiddensize = [256, 30] discount factor = 0.99 learning rate = 0.0001 − greedy = 1.0 → 0.01 n target steps = 20 Loss = MSE batchsize = 64 memory capacity = 1000 I’ve also added some learning decaying lr = lr/2 every 5 episodes 5 TME 5 The policy gradient methods goal is to model and optimize the policy directly. The policy is usually modeled with a function(for instance a neural network) parameterized by θ w.r.t πθ(a|s) . The value of the reward (our ultimate objective) depends on this policy. Several algorithms were proposed, and in this TME we will use A2C , the latter has been shown to be able to utilize GPUs more efficiently and work better with large batch sizes . Actor critic approaches are based on 2 concepts : 8
  • 10. • The “Critic” : estimates the value function. This could be the action-value (the Q value) or state-value (the V value). • The “Actor” : updates the policy distribution in the direction suggested by the Critic (such as with policy gradients). the following are our results after several experiments of A2C on CartPole game : Globally, We could realize that if A2C is well trained, after several episodes the algorithm start converging to the best solution with more stability, For example in CartPole achieving 500 actions ,which was not the case of DQN regarding stability and convergence . the hyper parameters are : n episodes = 5000 hiddensize = 128 discount factor = 0.999 learning rate = 0.0005 batchsize = 128 6 TME 9 GANs Since, I’ve enrolled RDFIA course with Pr Matthieu Cord, and passed several days imple- menting and experimenting GANS and conditional GANs with their sensitive hyper parameters.I decided to not redo it and report the directly from my previous work to not waste time with something I learned and understood very well. 9
  • 11. DCGANS : Figure 1: GANs generation results through learning process Figure 2: GANs Losses through learning process • Generations get more smooth and realistic through iterations but after half the iterations the results are almost the same and they are not really improving . • As expected , the Generator loss is decreasing and You the discriminator loss is increasing, it means that our Generator successfully generates images that our discriminator fails to catch . • there is no stability ... the model keep oscillating . • images are very diverse in terms of background ( darker, lighter ), skin color, hair style, gender, ... but they still are not very realistic . after doing a lot of experiments , We conclude that : • GANs are extremely sensitive to the learning rate , a slight change by 0.0001 or 0.0002 could lead to very slow convergence or divergence (instability) . Additionally , we have to decrease the learning rate ( learning rate decay ) while training , because the learning rate that we needed to generate smooth textures from randomness is not the same as trying to render a correct face with all coherent details . 10
  • 12. • Increasing the momentum β1 to the default value 0.9 ( approximately we calculate our moving average over 10 recent gradients ) resulted in training oscillation and instability while reducing it to 0.5 ( moving average over 2 recent gradients ) helped stabilize training . • batch size 128 and 256 turn out to be a great trade off , We tried with 512 and 64 and the results were not generated at all after a lot of iterations , thus we stopped it and turn it back to 128/256 . • training the model longer does not necessarily implies better practical performances , most of times • balance nbStepsD and nbStepsG every step taken down the hill changes the entire land- scape a little. It’s a dynamic system where the optimization process is seeking not a minimum, but a ”nash equilibrium” between two forces. We put nbupdateD = 10 and we realized that the training experience was going better and better and we got rapidly plausible images . • noise size = 100 is a good heuristic that works , we tried with 10 and the results were bad, we guess that for MNIST data researchers needed 100 so for faces we will need at least 100 , with 1000 we got on error of shapes for 32 images and architecture . We extend it to 512 ( max ) and nothing specifically relevant has been noticed . • after passing to 64 × 64 images and extending our architectures , we realized that GANs has a strong potential , to fit distributions smoothly on highly dimension data , and our results are the following : Figure 3: GANs final result on 64 x 64 images 11
  • 13. cDCGANS : the only different is that we will deal with the joint distribution PX,Y (x, y) instead of PX(x) Figure 4: cDCGANs generation results through learning process on MNIST Figure 5: cDCGANs Losses through learning process on MNIST Figure 6: cDCGANs final result after 20 epochs on MNIST 12
  • 14. • the results are extremely realistic • Generator loss decreases and stabilizes perfectly • The Discriminator is almost unable to distinct between real images and fakes ones, his loss increases and then converges . • decreasing the learning rate helps a lot for a smoother learning , however I think that after many epoches, the generator wasn’t able to move on and find a better local minima, it seems to stick to a local minima, because of the extremely small learning rate . Hence, the mess of very few example ( 0 dotted , 2 encapsulated , 7 dense and rotated , and finally a circular 4 ) 7 TME 10(VAEs) the following is the evolution of results encoded in a 2D space using a VAE : the learning is very 1 2 3 smooth according to the average loss function (avg/epoch) after finding good hyper parameters : n epochs = 10 latentdim = 2 learning rate = 0.01 batchsize = 128 13
  • 15. As we can see VAE suffered from a problem of blur, after fine tuning our model, the results are somehow realistic but sometimes blurry, After studying the effect of the hidden dimension in our Linear FeedForward Network we end up with these findings : 2D denoising : 5D denoising : 20D denoising : 14
  • 16. Conclusion : The more we increase the encoding space the sharper the decoding is and the better the reconstruction is . In addition, we have scattered MNIST test data on 2D space in the case of 2D encoding : the constructed clusters on unseen data (MNIST test dataset) are very plausible and realistic 15
  • 17. 8 Project our RTS game looks like the following figure : our task is to gather the maximum number of golds our reward formula for each step : difference gold = current gold − previous gold reward = α ∗ difference gold + β ∗ nearest cell gold with : • nearest cell gold defined as the distance between our agent and the nearest cell containing gold. • α and β are hyper parameters which we experiment, scale, etc ... 1. We have started with DQN, it was very slow to train, and except the code we have nothing to report about it 2. We then tried A2C algorithm for several trials, and the agent was unable to harvest at least one gold, which was very unexpected ! We saved one of the results our experiments : 16
  • 18. We tried different α,β and it does not work We changed the reward formula several times and it does not work We believe that with more computation power, more experiments with other algorithms, and parameters hyper tuning it could work . 17