Ukrainian Catholic University
Faculty of Applied Sciences
Data Science Master Program
January 23rd
Abstract. In this project (Glusco and Maksymenko, 2019), we treat the Reinforcement Learning problem of Exploration vs. Exploitation. The problem can be rephrased in terms of generalization and overfitting or efficient learning. To face the problem we decided to combine the techniques from different researches: we introduce noise as an environment’s characteristics (Packer et al., 2018); create multiple Reinforcement Learning agents and environments setup to train in parallel and interact within each other (Jaderberg et al., 2017); use parallel tempering approach to initialize environments with different temperatures (noises) and perform exchanges using Metropolis-Hastings criterion (Pushkarov et al., 2019). We implemented multi-agent architecture with a parallel tempering approach based on two different Reinforcement Learning agent algorithms – Deep Q Network and Advantage Actor-Critic – and environment wrapper of the OpenAI Gym (Gym: A toolkit for developing and comparing reinforcement learning algorithms) environment for noise addition. We used the CartPole environment to run multiple experiments with three different types of exchanges: no exchange, random exchange, smart exchange according to Metropolis-Hastings rule. We implemented aggregation functionality to gather the results of all the experiments and visualize them with charts for analysis. Experiments showed that a parallel tempering approach with multiple environments with different noise level can improve the performance of the agent under specific circumstances. At the same time, results raised new questions that should be addressed to fully understand the picture of the implemented approach.
4. Introduction to Reinforcement Learning:
Markov Decision Process
https://en.wikipedia.org/wiki/Markov_decision_process
Return
State-value function
Action-value function
Bellman Optimality Equation
6. Related work
● “Assessing Generalization in Deep
Reinforcement Learning”
● “Population Based Training of Neural
Networks”
● “Training Deep Neural Networks by
optimizing over nonlocal paths in
hyperparameter space”
12. Evaluation:
A2C, Smart Exchange
● Distance curves are
combined
● Training scores with low
noise are higher and more
stable
● Play scores a little bit more
stable
13. Evaluation:
DQN, No Exchange
● Inverse distance picture
● High noise dominates agent
● Decent play score only at
agents with low noise
17. Future work
● Environments with higher complexity
● Different noise types
● Compare explorations (noise vs agent)
● Exploration potential of the environment
● Parallelize training between exchanges
18. Review discussion
C1: No reason to have Bellman’s equation in this work (Chapter 1).
A1: Bellman’s equation is a fundamental equation.
C2: Equation on page 8 lacks any explanation.
A2: Agree. Provide more context or remove the formula.
C3: Page 9 hints at some convergence results, but lacks sufficient explanation and assumptions.
A3: It is a well-known property of the detailed balance condition and Metropolis-Hastings sampling that in
the long time limit we efficiently sample from the underlying stationary distribution.
19. Review discussion
C4: Why was a multiplicative noise added, and not an additive noise, which seems like a
more natural choice for Cart Pole?
A4: Multiplicative noise takes into account the scale of the value.
C5: The choice of DQN and A2C seems an overkill and somewhat counterproductive. I
would expect that policy methods would be a more natural choice, for instance,
crossentropy or natural evolution strategies.
A5: Focus on the diffusion picture.
20. Review discussion
C6: Hypothesis 1: whether noise improves exploration. I am not convinced that the charts
shown are enough evidence of the first part. I would have liked to see the trajectories of the
cart in the space (x, x’) which corresponds to the two first coordinates.
A6: Distance is not bound to the environment and only depend on the model parameters.
C7: Hypothesis 2: replica exchange improves training. Improving by getting a better score is
meaningless. The real challenge is to justify the usefulness of a method by showing that it
trains faster/cheaper than the competitors.
A7: Used score as mean values and their variance, which shows the stability of the agent.
Speed/computation mentioned as future work.
21. Review discussion
C8: The Appendix feels indeed like covering some page quota, charts are hard to read and
might even hint at some data leakage issues, as they suspiciously overlap with each other
too much.
A8: Show more results for a more complete picture.
C9: There are minor grammar issues (misplaced articles), but nothing serious.
A9:
Hello, my name is Dmitrii Glushko, my supervisor is Dr. Mykola Maksymenko and I want to present you my master thesis work that is called Replica Exchange For ...
Here is the contents of my presentation, we will briefly go through introduction to RL, describe the chosen problem, look at the related work, show solution and evaluation, draw conclusions and discuss future work and review
So let’s briefly describe the key concepts in the RL:
We have environment and agent. We can control only agent thus we train him. Agent can act with the environment by choosing available action, observe the environment’s state and receive the reward. Main idea is to maximize the obtained reward.
Usually RL environment is described as MDP. It is a set of states, actions, transitions between them and rewards. For the total reward we have the therm “Return”. Few more notions here: policy is the behavior of the agent (what actions he chose in the given state), state value function is the return value for being in a given state, action-value function is the return value for choosing the specific action in a given state. v* or q* is the optimal value/actionvalue function (maximizes).
And the foundation of all RL algorithms is the Bellman Optimality Equation for MDPs which shows us that to obtain optimal values we need to chose actions that lead us to the optimal value at the next state.
Statement about exploration: behavior of the agent to choose the action to find the action-state combinations not seen before
Statement about exploitation: to choose optimal action.
This problem leads to not finding the best agent and the result highly relate on the initial state of the agent -> pure stability of the agents in RL
To solve the problem we need to find optimal balance of efficiently exploring promising parts of the space
We reviewed many different related papers but let’s talk about the papers which we are using to build our own solution.
Noise
PBT: hyperparameter optimization. PBT is a technique to train multiple agents at the same time to find the optimal setup. Technique is based on the genetic algorithm: copy best agents or explore by randomly changing parameters
Hyperparameter optimization: noise flattens the loss function - avoid stucking in a local-minima. Changing hyperparameters by the Metropolis-Hastings exchange rule, which guarantees that in the long-time there is no dependence on the initial state (explore whole hyperparameter space)
Noise similar to the assessing generalization paper, but not using sunblaze envs
Multiple agents and environments setup to train in parallel similar to the PBT, but with different approach of communication
use Metropolis-Hastings Exchange rule similar ot the Training Deep Neural Networks by optimizing over nonlocal paths in hyperparameter space, with adapted exchange rule
Improves training in terms of having stable best results across different experiments
Appendix slides
Slide for hypotheses
Noise makes landscape smoother
Metropolis Hastings rule exchange
Slide for hypotheses
Noise makes landscape smoother
Metropolis Hastings rule exchange
Slide for hypotheses
Noise makes landscape smoother
Metropolis Hastings rule exchange
Slide for hypotheses
Noise makes landscape smoother
Metropolis Hastings rule exchange