RL Training with Metropolis-Hastings Exchange

Replica Exchange For
Multiple-Environment
Reinforcement Learning
Author:
Dmitri Glusco
Supervisor:
Dr. Mykola Maksymenko

● Introduction to Reinforcement Learning
● Exploration/Exploitation dilemma
● Related work
● Solution
● Evaluation
● Conclusions
● Future work
● Review discussion
Contents

Introduction to Reinforcement Learning
https://www.youtube.com/playlist?list=PL7-jPKtc4r78-wCZcQn5IqyuWhBZ8fOxT

Introduction to Reinforcement Learning:
Markov Decision Process
https://en.wikipedia.org/wiki/Markov_decision_process
Return
State-value function
Action-value function
Bellman Optimality Equation

Exploration/Exploitation dilemma
http://dx.doi.org/10.1039/B509983H
● Epsilon-greedy
● Softmax
● Entropy loss
● Regularization
● Dropout
● Noise

Related work
● “Assessing Generalization in Deep
Reinforcement Learning”
● “Population Based Training of Neural
Networks”
● “Training Deep Neural Networks by
optimizing over nonlocal paths in
hyperparameter space”

Solution
Noise
Agent
DQN, A2C
Metropolis-Hastings Exchange

Solution: Hypotheses
● Noise improves exploration
● Metropolis-Hastings replica-exchange improves RL training

Solution: Research pipeline architecture
● No Exchange
● Random Exchange
● Smart Exchange
Exchange Types

Evaluation:
A2C, No Exchange
● Noise slows learning
● Stable 500 play score at
agents with higher noises

Evaluation:
A2C, Random Exchange
● Distance curves are
combined
● Training scores more
volatile
● Play scores a little bit more
stable

Evaluation:
A2C, Smart Exchange
● Distance curves are
combined
● Training scores with low
noise are higher and more
stable
● Play scores a little bit more
stable

Evaluation:
DQN, No Exchange
● Inverse distance picture
● High noise dominates agent
● Decent play score only at
agents with low noise

Evaluation:
DQN, Random Exchange

Evaluation:
DQN, Smart Exchange
● Training scores with low
noise are higher and more
stable
● Worse play scores

Conclusions
● Metropolis-Hastings replica-
exchange improves RL training
Hypotheses

Future work
● Environments with higher complexity
● Different noise types
● Compare explorations (noise vs agent)
● Exploration potential of the environment
● Parallelize training between exchanges

Review discussion
C1: No reason to have Bellman’s equation in this work (Chapter 1).
A1: Bellman’s equation is a fundamental equation.
C2: Equation on page 8 lacks any explanation.
A2: Agree. Provide more context or remove the formula.
C3: Page 9 hints at some convergence results, but lacks sufficient explanation and assumptions.
A3: It is a well-known property of the detailed balance condition and Metropolis-Hastings sampling that in
the long time limit we efficiently sample from the underlying stationary distribution.

Review discussion
C4: Why was a multiplicative noise added, and not an additive noise, which seems like a
more natural choice for Cart Pole?
A4: Multiplicative noise takes into account the scale of the value.
C5: The choice of DQN and A2C seems an overkill and somewhat counterproductive. I
would expect that policy methods would be a more natural choice, for instance,
crossentropy or natural evolution strategies.
A5: Focus on the diffusion picture.

Review discussion
C6: Hypothesis 1: whether noise improves exploration. I am not convinced that the charts
shown are enough evidence of the first part. I would have liked to see the trajectories of the
cart in the space (x, x’) which corresponds to the two first coordinates.
A6: Distance is not bound to the environment and only depend on the model parameters.
C7: Hypothesis 2: replica exchange improves training. Improving by getting a better score is
meaningless. The real challenge is to justify the usefulness of a method by showing that it
trains faster/cheaper than the competitors.
A7: Used score as mean values and their variance, which shows the stability of the agent.
Speed/computation mentioned as future work.

Review discussion
C8: The Appendix feels indeed like covering some page quota, charts are hard to read and
might even hint at some data leakage issues, as they suspiciously overlap with each other
too much.
A8: Show more results for a more complete picture.
C9: There are minor grammar issues (misplaced articles), but nothing serious.
A9:

RL Training with Metropolis-Hastings Exchange

Recommended

Recommended

More Related Content

What's hot

What's hot (9)

Similar to RL Training with Metropolis-Hastings Exchange

Similar to RL Training with Metropolis-Hastings Exchange (20)

More from Lviv Data Science Summer School

More from Lviv Data Science Summer School (20)

Recently uploaded

Recently uploaded (20)

RL Training with Metropolis-Hastings Exchange

Editor's Notes