RL_UpsideDown

RL Upside-Down:
Training Agents using Upside-Down RL
LEE, DOHYEON
leadh991114@gmail.com
4/16/23 딥논읽 세미나 - 강화학습
NeurIPS 2019 Workshop

Contents
1. Introduction
2. Methods
3. Experiments
4. Conclusion
4/16/23 딥논읽 세미나 - 강화학습 1

| Some Doubts about RL
INDEX
Introduction
Methods
Performance
Conclusion
| “Deep RL Doesn’t Work Yet”

1. Sample Inefficiency
INDEX
Introduction
Methods
Performance
Conclusion
Atari Games
- 100% Rainbow when
about 18 millions of frames;
- ~ 83 hours of play experience

2. Nice Alternatives
INDEX
Introduction
Methods
Performance
Conclusion
Optimal Control Theory
- LQR, QP, Convex Optimization
- Model Predictive Control(MPC)

3. Hard Reward Design
INDEX
Introduction
Methods
Performance
Conclusion
The Alignment Problem
- Universe(OpenAI)
- reward function must capture
”exactly” what you want

4. Local Optima
INDEX
Introduction
Methods
Performance
Conclusion
Exploration vs Exploitation
- HalfCheetah(UC Bekeley, BAIR)
- The dilemma is too hard to solve

5. Generalization Issue
INDEX
Introduction
Methods
Performance
Conclusion
Overfitting
- Laser Tag for Multi-agent
- Lanctot et al, NeurIPS 2017

6. Stability & Reproducibility Problem
INDEX
Introduction
Methods
Performance
Conclusion
HalfCheetah
- Houthooft et al, NIPS 2016
75%
25%

| Why don’t we leverage the advantages of SL?
INDEX
Introduction
Methods
Performance
Conclusion
| Advantages of SL algorithms
1. Simplicity
2. Robustness
3. Scalability
| “In general, there is no way to do this”
1. SL:
- Function: Search
- Assumption: I.I.D./Stationary Condition
- Feedback from Env: Error Signal
2. RL:
- Function: Search & Long-Term Memory
- Assumption: Non-I.I.D./Non-Stationary Condition
- Feedback from Env: Evaluation Signal

| A Trick to convert RL to SL!
INDEX
Introduction
Methods
Performance
Conclusion
| MDP → Supervised Learning; Classification Problem!
Goal: To maximize returns in expectation
→ To learn to follow commands such as;
- “achieve total reward R in next T time steps”
- “reach state S in fewer than T time steps”.

QnA
INDEX
Introduction
Methods
Performance
Conclusion

1. Idea
INDEX
Introduction
Methods
Performance
Conclusion
How? S →A →R → S →A →R → … into S → R’ → A → S → R’ → A → …
RL!

1. Idea
INDEX
Introduction
Methods
Performance
Conclusion
Intuitively, it answers the question:
| “if an agent is in a given state and desires a given return over a given
| horizon, which action should it take next based on past experience?”

2. Behavior Function
INDEX
Introduction
Methods
Performance
Conclusion
Notation
- 𝑑!
: desired return
- 𝑑" : desired horizon
- 𝑆 : random variable for environment’s state
- 𝐴 : random variable for the agent’s next action
- 𝑅#! : random variable for the return obtained by the agent during the next 𝑑"
time steps.
- 𝒯 : set of trajectories
- 𝐵$ : policy-based behavior function 𝐵!(𝑎, 𝑠, 𝑑", 𝑑#)
- 𝐵𝓣 : trajectory-based behavior function using an unknown policy is available,
where where 𝑁#
$
(𝑠, 𝑑", 𝑑#) is the number of trajectory segments in 𝒯 that
start in state 𝑠, have length 𝑑# and total reward 𝑑"

2. Behavior Function
INDEX
Introduction
Methods
Performance
Conclusion
| “Using a loss function 𝑳, it can be estimated by solving the following SL problem”

3. Algorithm
INDEX
Introduction
Methods
Performance
Conclusion

3. Algorithm
INDEX
Introduction
Methods
Performance
Conclusion
| ⅂ꓤ does not explicitly maximize returns, but…
Learning can be biased towards higher returns
by selecting the trajectories on which the behavior function is trained!
| To Do So,
Use a replay buffer with the best 𝑍 trajectories seen so far,
where 𝑍 is a fixed hyperparameter.

3. Algorithm
INDEX
Introduction
Methods
Performance
Conclusion
| At any time 𝑡 during an episode,
the current behavior function 𝐵 produces a distribution
over actions 𝑃 𝑎& 𝑠&, 𝑐& = 𝐵(𝑠&, 𝑐&; 𝜃)
| Given an initial command 𝑐' for a new episode,
a new trajectory is generated using Algorithm 2 by sampling actions according to B and
updating the current command using the obtained rewards and time left at each time step
until the episode terminates.

3. Algorithm
INDEX
Introduction
Methods
Performance
Conclusion

3. Algorithm
INDEX
Introduction
Methods
Performance
Conclusion
| After each training phase the agent can be given new commands,
potentially achieving higher returns due to additional knowledge gained by further training.
| To profit from such exploration through generalization,
a set of new initial commands c0 to be used in Algorithm 2 is generated.
1. A number of episodes with the highest returns are selected from the replay buffer.
This number is a hyperparameter and remains fixed during training.
2. The exploratory desired horizon 𝑑!
"
is set to the mean of the lengths of the selected episodes.
3. The exploratory desired returns 𝑑!
#
are sampled from the uniform distribution 𝒰[𝑀, 𝑀 + 𝑆]
where 𝑀 is the mean and 𝑆 is the standard deviation of the selected episodic returns.

3. Algorithm
INDEX
Introduction
Methods
Performance
Conclusion

3. Algorithm
INDEX
Introduction
Methods
Performance
Conclusion
Algorithm 2 is also used to evaluate the agent at any time using evaluation
commands derived from the most recent exploratory commands.
The initial desired return 𝑑'
!
is set to 𝑀 , the lower bound of the desired
returns from the most recent exploratory command, and the initial desired
horizon 𝑑'
"
is reused.

QnA
INDEX
Introduction
Methods
Performance
Conclusion

1. Tasks
INDEX
Introduction
Methods
Performance
Conclusion
• Fully-connected feed-forward neural networks, except for
TakeCover-v0 where we used convolutional networks
• Use environments with both low and high-dimensional (visual)
observations, and both discrete and continuous-valued actions:
• LunarLander-v2 based on Box2D
• TakeCover-v0 based on VizDoom
• Swimmer-v2 & InvertedDoublePendulum-v2 based on the MuJoCo

2. Results
INDEX
Introduction
Methods
Performance
Conclusion

2. ver.Sparse
INDEX
Introduction
Methods
Performance
Conclusion
| Since ⅂ꓤ does not use temporal differences for learning,
it is reasonable to hypothesize that its behavior may change differently from other
algorithms that do. To test this, we converted environments to their sparse, delayed reward
(partially observable) versions by delaying all rewards until the last step of each episode.

QnA
INDEX
Introduction
Methods
Performance
Conclusion

Conclusion
INDEX
Introduction
Methods
Performance
Conclusion

?
INDEX
Introduction
Methods
Performance
Conclusion

Thank You For Your Listening!
INDEX
Introduction
Methods
Performance
Conclusion
RL
SL
UL

RL_UpsideDown

Recommended

Recommended

More Related Content

Similar to RL_UpsideDown

Similar to RL_UpsideDown (20)

More from taeseon ryu

More from taeseon ryu (20)

Recently uploaded

Recently uploaded (20)

RL_UpsideDown