RL Upside-Down:
Training Agents using Upside-Down RL
LEE, DOHYEON
leadh991114@gmail.com
4/16/23 딥논읽 세미나 - 강화학습
NeurIPS 2019 Workshop
Contents
1. Introduction
2. Methods
3. Experiments
4. Conclusion
4/16/23 딥논읽 세미나 - 강화학습 1
| Some Doubts about RL
4/16/23 딥논읽 세미나 - 강화학습 2
INDEX
Introduction
Methods
Performance
Conclusion
| “Deep RL Doesn’t Work Yet”
1. Sample Inefficiency
4/16/23 딥논읽 세미나 - 강화학습 3
INDEX
Introduction
Methods
Performance
Conclusion
Atari Games
- 100% Rainbow when
about 18 millions of frames;
- ~ 83 hours of play experience
2. Nice Alternatives
4/16/23 딥논읽 세미나 - 강화학습 4
INDEX
Introduction
Methods
Performance
Conclusion
Optimal Control Theory
- LQR, QP, Convex Optimization
- Model Predictive Control(MPC)
3. Hard Reward Design
4/16/23 딥논읽 세미나 - 강화학습 5
INDEX
Introduction
Methods
Performance
Conclusion
The Alignment Problem
- Universe(OpenAI)
- reward function must capture
”exactly” what you want
4. Local Optima
4/16/23 딥논읽 세미나 - 강화학습 6
INDEX
Introduction
Methods
Performance
Conclusion
Exploration vs Exploitation
- HalfCheetah(UC Bekeley, BAIR)
- The dilemma is too hard to solve
5. Generalization Issue
4/16/23 딥논읽 세미나 - 강화학습 7
INDEX
Introduction
Methods
Performance
Conclusion
Overfitting
- Laser Tag for Multi-agent
- Lanctot et al, NeurIPS 2017
6. Stability & Reproducibility Problem
4/16/23 딥논읽 세미나 - 강화학습 8
INDEX
Introduction
Methods
Performance
Conclusion
HalfCheetah
- Houthooft et al, NIPS 2016
75%
25%
| Why don’t we leverage the advantages of SL?
4/16/23 딥논읽 세미나 - 강화학습 9
INDEX
Introduction
Methods
Performance
Conclusion
| Advantages of SL algorithms
1. Simplicity
2. Robustness
3. Scalability
| “In general, there is no way to do this”
1. SL:
- Function: Search
- Assumption: I.I.D./Stationary Condition
- Feedback from Env: Error Signal
2. RL:
- Function: Search & Long-Term Memory
- Assumption: Non-I.I.D./Non-Stationary Condition
- Feedback from Env: Evaluation Signal
| A Trick to convert RL to SL!
4/16/23 딥논읽 세미나 - 강화학습 10
INDEX
Introduction
Methods
Performance
Conclusion
| MDP → Supervised Learning; Classification Problem!
Goal: To maximize returns in expectation
→ To learn to follow commands such as;
- “achieve total reward R in next T time steps”
- “reach state S in fewer than T time steps”.
QnA
4/16/23 딥논읽 세미나 - 강화학습 11
INDEX
Introduction
Methods
Performance
Conclusion
1. Idea
4/16/23 딥논읽 세미나 - 강화학습 12
INDEX
Introduction
Methods
Performance
Conclusion
How? S →A →R → S →A →R → … into S → R’ → A → S → R’ → A → …
RL!
1. Idea
4/16/23 딥논읽 세미나 - 강화학습 13
INDEX
Introduction
Methods
Performance
Conclusion
Intuitively, it answers the question:
| “if an agent is in a given state and desires a given return over a given
| horizon, which action should it take next based on past experience?”
2. Behavior Function
4/16/23 딥논읽 세미나 - 강화학습 14
INDEX
Introduction
Methods
Performance
Conclusion
Notation
- 𝑑!
: desired return
- 𝑑" : desired horizon
- 𝑆 : random variable for environment’s state
- 𝐴 : random variable for the agent’s next action
- 𝑅#! : random variable for the return obtained by the agent during the next 𝑑"
time steps.
- 𝒯 : set of trajectories
- 𝐵$ : policy-based behavior function 𝐵!(𝑎, 𝑠, 𝑑", 𝑑#)
- 𝐵𝓣 : trajectory-based behavior function using an unknown policy is available,
where where 𝑁#
$
(𝑠, 𝑑", 𝑑#) is the number of trajectory segments in 𝒯 that
start in state 𝑠, have length 𝑑# and total reward 𝑑"
2. Behavior Function
4/16/23 딥논읽 세미나 - 강화학습 15
INDEX
Introduction
Methods
Performance
Conclusion
| “Using a loss function 𝑳, it can be estimated by solving the following SL problem”
3. Algorithm
4/16/23 딥논읽 세미나 - 강화학습 16
INDEX
Introduction
Methods
Performance
Conclusion
3. Algorithm
4/16/23 딥논읽 세미나 - 강화학습 17
INDEX
Introduction
Methods
Performance
Conclusion
| ⅂ꓤ does not explicitly maximize returns, but…
Learning can be biased towards higher returns
by selecting the trajectories on which the behavior function is trained!
| To Do So,
Use a replay buffer with the best 𝑍 trajectories seen so far,
where 𝑍 is a fixed hyperparameter.
3. Algorithm
4/16/23 딥논읽 세미나 - 강화학습 18
INDEX
Introduction
Methods
Performance
Conclusion
| At any time 𝑡 during an episode,
the current behavior function 𝐵 produces a distribution
over actions 𝑃 𝑎& 𝑠&, 𝑐& = 𝐵(𝑠&, 𝑐&; 𝜃)
| Given an initial command 𝑐' for a new episode,
a new trajectory is generated using Algorithm 2 by sampling actions according to B and
updating the current command using the obtained rewards and time left at each time step
until the episode terminates.
3. Algorithm
4/16/23 딥논읽 세미나 - 강화학습 19
INDEX
Introduction
Methods
Performance
Conclusion
3. Algorithm
4/16/23 딥논읽 세미나 - 강화학습 20
INDEX
Introduction
Methods
Performance
Conclusion
| After each training phase the agent can be given new commands,
potentially achieving higher returns due to additional knowledge gained by further training.
| To profit from such exploration through generalization,
a set of new initial commands c0 to be used in Algorithm 2 is generated.
1. A number of episodes with the highest returns are selected from the replay buffer.
This number is a hyperparameter and remains fixed during training.
2. The exploratory desired horizon 𝑑!
"
is set to the mean of the lengths of the selected episodes.
3. The exploratory desired returns 𝑑!
#
are sampled from the uniform distribution 𝒰[𝑀, 𝑀 + 𝑆]
where 𝑀 is the mean and 𝑆 is the standard deviation of the selected episodic returns.
3. Algorithm
4/16/23 딥논읽 세미나 - 강화학습 21
INDEX
Introduction
Methods
Performance
Conclusion
3. Algorithm
4/16/23 딥논읽 세미나 - 강화학습 22
INDEX
Introduction
Methods
Performance
Conclusion
Algorithm 2 is also used to evaluate the agent at any time using evaluation
commands derived from the most recent exploratory commands.
The initial desired return 𝑑'
!
is set to 𝑀 , the lower bound of the desired
returns from the most recent exploratory command, and the initial desired
horizon 𝑑'
"
is reused.
QnA
4/16/23 딥논읽 세미나 - 강화학습 23
INDEX
Introduction
Methods
Performance
Conclusion
1. Tasks
4/16/23 딥논읽 세미나 - 강화학습 24
INDEX
Introduction
Methods
Performance
Conclusion
• Fully-connected feed-forward neural networks, except for
TakeCover-v0 where we used convolutional networks
• Use environments with both low and high-dimensional (visual)
observations, and both discrete and continuous-valued actions:
• LunarLander-v2 based on Box2D
• TakeCover-v0 based on VizDoom
• Swimmer-v2 & InvertedDoublePendulum-v2 based on the MuJoCo
2. Results
4/16/23 딥논읽 세미나 - 강화학습 25
INDEX
Introduction
Methods
Performance
Conclusion
2. ver.Sparse
4/16/23 딥논읽 세미나 - 강화학습 26
INDEX
Introduction
Methods
Performance
Conclusion
| Since ⅂ꓤ does not use temporal differences for learning,
it is reasonable to hypothesize that its behavior may change differently from other
algorithms that do. To test this, we converted environments to their sparse, delayed reward
(partially observable) versions by delaying all rewards until the last step of each episode.
QnA
4/16/23 딥논읽 세미나 - 강화학습 27
INDEX
Introduction
Methods
Performance
Conclusion
Conclusion
4/16/23 딥논읽 세미나 - 강화학습 28
INDEX
Introduction
Methods
Performance
Conclusion
?
4/16/23 딥논읽 세미나 - 강화학습 29
INDEX
Introduction
Methods
Performance
Conclusion
Thank You For Your Listening!
4/16/23 딥논읽 세미나 - 강화학습 30
INDEX
Introduction
Methods
Performance
Conclusion
RL
SL
UL

RL_UpsideDown

  • 1.
    RL Upside-Down: Training Agentsusing Upside-Down RL LEE, DOHYEON leadh991114@gmail.com 4/16/23 딥논읽 세미나 - 강화학습 NeurIPS 2019 Workshop
  • 2.
    Contents 1. Introduction 2. Methods 3.Experiments 4. Conclusion 4/16/23 딥논읽 세미나 - 강화학습 1
  • 3.
    | Some Doubtsabout RL 4/16/23 딥논읽 세미나 - 강화학습 2 INDEX Introduction Methods Performance Conclusion | “Deep RL Doesn’t Work Yet”
  • 4.
    1. Sample Inefficiency 4/16/23딥논읽 세미나 - 강화학습 3 INDEX Introduction Methods Performance Conclusion Atari Games - 100% Rainbow when about 18 millions of frames; - ~ 83 hours of play experience
  • 5.
    2. Nice Alternatives 4/16/23딥논읽 세미나 - 강화학습 4 INDEX Introduction Methods Performance Conclusion Optimal Control Theory - LQR, QP, Convex Optimization - Model Predictive Control(MPC)
  • 6.
    3. Hard RewardDesign 4/16/23 딥논읽 세미나 - 강화학습 5 INDEX Introduction Methods Performance Conclusion The Alignment Problem - Universe(OpenAI) - reward function must capture ”exactly” what you want
  • 7.
    4. Local Optima 4/16/23딥논읽 세미나 - 강화학습 6 INDEX Introduction Methods Performance Conclusion Exploration vs Exploitation - HalfCheetah(UC Bekeley, BAIR) - The dilemma is too hard to solve
  • 8.
    5. Generalization Issue 4/16/23딥논읽 세미나 - 강화학습 7 INDEX Introduction Methods Performance Conclusion Overfitting - Laser Tag for Multi-agent - Lanctot et al, NeurIPS 2017
  • 9.
    6. Stability &Reproducibility Problem 4/16/23 딥논읽 세미나 - 강화학습 8 INDEX Introduction Methods Performance Conclusion HalfCheetah - Houthooft et al, NIPS 2016 75% 25%
  • 10.
    | Why don’twe leverage the advantages of SL? 4/16/23 딥논읽 세미나 - 강화학습 9 INDEX Introduction Methods Performance Conclusion | Advantages of SL algorithms 1. Simplicity 2. Robustness 3. Scalability | “In general, there is no way to do this” 1. SL: - Function: Search - Assumption: I.I.D./Stationary Condition - Feedback from Env: Error Signal 2. RL: - Function: Search & Long-Term Memory - Assumption: Non-I.I.D./Non-Stationary Condition - Feedback from Env: Evaluation Signal
  • 11.
    | A Trickto convert RL to SL! 4/16/23 딥논읽 세미나 - 강화학습 10 INDEX Introduction Methods Performance Conclusion | MDP → Supervised Learning; Classification Problem! Goal: To maximize returns in expectation → To learn to follow commands such as; - “achieve total reward R in next T time steps” - “reach state S in fewer than T time steps”.
  • 12.
    QnA 4/16/23 딥논읽 세미나- 강화학습 11 INDEX Introduction Methods Performance Conclusion
  • 13.
    1. Idea 4/16/23 딥논읽세미나 - 강화학습 12 INDEX Introduction Methods Performance Conclusion How? S →A →R → S →A →R → … into S → R’ → A → S → R’ → A → … RL!
  • 14.
    1. Idea 4/16/23 딥논읽세미나 - 강화학습 13 INDEX Introduction Methods Performance Conclusion Intuitively, it answers the question: | “if an agent is in a given state and desires a given return over a given | horizon, which action should it take next based on past experience?”
  • 15.
    2. Behavior Function 4/16/23딥논읽 세미나 - 강화학습 14 INDEX Introduction Methods Performance Conclusion Notation - 𝑑! : desired return - 𝑑" : desired horizon - 𝑆 : random variable for environment’s state - 𝐴 : random variable for the agent’s next action - 𝑅#! : random variable for the return obtained by the agent during the next 𝑑" time steps. - 𝒯 : set of trajectories - 𝐵$ : policy-based behavior function 𝐵!(𝑎, 𝑠, 𝑑", 𝑑#) - 𝐵𝓣 : trajectory-based behavior function using an unknown policy is available, where where 𝑁# $ (𝑠, 𝑑", 𝑑#) is the number of trajectory segments in 𝒯 that start in state 𝑠, have length 𝑑# and total reward 𝑑"
  • 16.
    2. Behavior Function 4/16/23딥논읽 세미나 - 강화학습 15 INDEX Introduction Methods Performance Conclusion | “Using a loss function 𝑳, it can be estimated by solving the following SL problem”
  • 17.
    3. Algorithm 4/16/23 딥논읽세미나 - 강화학습 16 INDEX Introduction Methods Performance Conclusion
  • 18.
    3. Algorithm 4/16/23 딥논읽세미나 - 강화학습 17 INDEX Introduction Methods Performance Conclusion | ⅂ꓤ does not explicitly maximize returns, but… Learning can be biased towards higher returns by selecting the trajectories on which the behavior function is trained! | To Do So, Use a replay buffer with the best 𝑍 trajectories seen so far, where 𝑍 is a fixed hyperparameter.
  • 19.
    3. Algorithm 4/16/23 딥논읽세미나 - 강화학습 18 INDEX Introduction Methods Performance Conclusion | At any time 𝑡 during an episode, the current behavior function 𝐵 produces a distribution over actions 𝑃 𝑎& 𝑠&, 𝑐& = 𝐵(𝑠&, 𝑐&; 𝜃) | Given an initial command 𝑐' for a new episode, a new trajectory is generated using Algorithm 2 by sampling actions according to B and updating the current command using the obtained rewards and time left at each time step until the episode terminates.
  • 20.
    3. Algorithm 4/16/23 딥논읽세미나 - 강화학습 19 INDEX Introduction Methods Performance Conclusion
  • 21.
    3. Algorithm 4/16/23 딥논읽세미나 - 강화학습 20 INDEX Introduction Methods Performance Conclusion | After each training phase the agent can be given new commands, potentially achieving higher returns due to additional knowledge gained by further training. | To profit from such exploration through generalization, a set of new initial commands c0 to be used in Algorithm 2 is generated. 1. A number of episodes with the highest returns are selected from the replay buffer. This number is a hyperparameter and remains fixed during training. 2. The exploratory desired horizon 𝑑! " is set to the mean of the lengths of the selected episodes. 3. The exploratory desired returns 𝑑! # are sampled from the uniform distribution 𝒰[𝑀, 𝑀 + 𝑆] where 𝑀 is the mean and 𝑆 is the standard deviation of the selected episodic returns.
  • 22.
    3. Algorithm 4/16/23 딥논읽세미나 - 강화학습 21 INDEX Introduction Methods Performance Conclusion
  • 23.
    3. Algorithm 4/16/23 딥논읽세미나 - 강화학습 22 INDEX Introduction Methods Performance Conclusion Algorithm 2 is also used to evaluate the agent at any time using evaluation commands derived from the most recent exploratory commands. The initial desired return 𝑑' ! is set to 𝑀 , the lower bound of the desired returns from the most recent exploratory command, and the initial desired horizon 𝑑' " is reused.
  • 24.
    QnA 4/16/23 딥논읽 세미나- 강화학습 23 INDEX Introduction Methods Performance Conclusion
  • 25.
    1. Tasks 4/16/23 딥논읽세미나 - 강화학습 24 INDEX Introduction Methods Performance Conclusion • Fully-connected feed-forward neural networks, except for TakeCover-v0 where we used convolutional networks • Use environments with both low and high-dimensional (visual) observations, and both discrete and continuous-valued actions: • LunarLander-v2 based on Box2D • TakeCover-v0 based on VizDoom • Swimmer-v2 & InvertedDoublePendulum-v2 based on the MuJoCo
  • 26.
    2. Results 4/16/23 딥논읽세미나 - 강화학습 25 INDEX Introduction Methods Performance Conclusion
  • 27.
    2. ver.Sparse 4/16/23 딥논읽세미나 - 강화학습 26 INDEX Introduction Methods Performance Conclusion | Since ⅂ꓤ does not use temporal differences for learning, it is reasonable to hypothesize that its behavior may change differently from other algorithms that do. To test this, we converted environments to their sparse, delayed reward (partially observable) versions by delaying all rewards until the last step of each episode.
  • 28.
    QnA 4/16/23 딥논읽 세미나- 강화학습 27 INDEX Introduction Methods Performance Conclusion
  • 29.
    Conclusion 4/16/23 딥논읽 세미나- 강화학습 28 INDEX Introduction Methods Performance Conclusion
  • 30.
    ? 4/16/23 딥논읽 세미나- 강화학습 29 INDEX Introduction Methods Performance Conclusion
  • 31.
    Thank You ForYour Listening! 4/16/23 딥논읽 세미나 - 강화학습 30 INDEX Introduction Methods Performance Conclusion RL SL UL