SlideShare a Scribd company logo
1 of 18
1
Does Zero-Shot Reinforcement Learning Exist?
백승언, 김현성, 이도현 , 정강민
11 June, 2023
2
 Introduction
 Current Success of AI
 Meta Reinforcement Learning
 Does Zero-Shot Reinforcement Learning Exist?
 Backgrounds
 Previous Strategies for Zero-Shot RL
 Algorithms for SF and FB Representations
 Experiments
 Environments
 Results
Contents
3
Introduction
4
 Problem setting of reinforcement learning, meta-learning
 Reinforcement learning
• Given certain MDP, lean a policy 𝜋 that maximize the expected discounted return 𝔼𝜋,𝑝0
Σ𝑡=0
∞
𝛾𝑡−1
𝑟𝑡 𝑠𝑡, 𝑎𝑡, 𝑠𝑡+1
 Meta-learning
• Given data from 𝒯
1, … , 𝒯
N, quickly solve new task 𝒯
𝑡𝑒𝑠𝑡
 Problem setting of Meta-Reinforcement Learning(Meta-RL)
 Setting 1: meta-learning with diverse goal(goal as a task)
• 𝒯
𝑖 ≜ {𝒮, 𝒜, 𝑝 𝑠0 , 𝑝 𝑠′
𝑠, 𝑎 , 𝑟 𝑠, 𝑎, 𝑔 , 𝑔𝑖}
 Setting 2: meta-learning with RL tasks(MDP as a task)
• 𝒯
𝑖 ≜ 𝒮𝑖, 𝒜𝑖, 𝑝𝑖 𝑠0 , 𝑝𝑖 𝑠′
𝑠, 𝑎 , 𝑟𝑖(𝑠, 𝑎)
Meta Reinforcement Learning
Meta RL problem statement in CS-330(Finn)
5
Does Zero-Shot Reinforcement Learning
Exist?
6
 Notation
 Reward-free MDP
• ℳ = (𝑆, 𝐴, 𝑃, 𝛾) is a reward-free Markov Decision Process(MDP) with state space 𝑆, action space 𝐴, transition
probability 𝑃(𝑠′
|𝑠, 𝑎) from state 𝑠 to 𝑠′ given action 𝑎 and discount factor 0 < 𝛾 < 1
 Problem statement
 Goal of zero-shot RL is to compute a compact representation ℰ of the env by observing samples of
reward-free transitions (𝑠𝑡, 𝑎𝑡, 𝑠𝑡+1) in this env
 Once a reward function is specified later, the agent must use ℰ to immediately produce a good policy, via
only elementary computation without any further planning or learning
 Reward functions may be specified at test time either as a relatively small set of reward samples (𝑠𝑖, 𝑟𝑖), or
as an explicit function 𝑠 → 𝑟(𝑠)
Backgrounds (I) – Defining Zero-Shot RL
7
 Successor representations (SR)
 For a finite MDP, the successor representation 𝑀𝜋
(𝑠0, 𝑎0) of a state-action pair(𝑠0, 𝑎0) under a policy 𝜋, is
defined as the discounted sum of future occurrences of each state
• 𝑀𝜋
𝑠0, 𝑎0, 𝑠 ≔ 𝔼 Σ𝑡≥0𝛾𝑡
𝕀 𝑠𝑡+1 = 𝑠 | 𝑠0, 𝑎0 , 𝜋 , ∀𝑠 ∈ 𝑆
 In matrix form, SRs can be written as 𝑀𝜋 = 𝑃Σ𝑡≥0𝛾𝑡𝑃𝜋
𝑡 = 𝑃 𝐼 − 𝛾𝑃𝜋
−1, 𝑃𝜋 𝑖𝑠 𝑡ℎ𝑒 𝑠𝑡𝑎𝑡𝑒 𝑡𝑟𝑎𝑛𝑠𝑖𝑡𝑖𝑜𝑛 𝑝𝑟𝑜𝑏
• 𝑀𝜋
satisfies the matrix Bellman Equation 𝑀𝜋
= 𝑃 + 𝛾𝑃𝜋𝑀𝜋
, the Q-function can be expressed as 𝑄𝑟
𝜋
= 𝑀𝜋
𝑟
 Successor features (SFs)
 Successor features extend SR to continuous MDPs by first assuming we are given a basic feature map
𝜑: 𝑆 → ℝ𝑑
that embeds states into 𝑑-dimensional space, and defining the expected discounted sum of
future state features
• 𝜓𝜋
𝑠0, 𝑎0 ≔ 𝔼 Σ𝑡≥0𝛾𝑡
𝜑 𝑠𝑡+1 | 𝑠0, 𝑎0, 𝜋
 Successor measures (SMs)
 Successor measures extend SRs to continuous spaces by treating the distribution of future visited states
as a measure 𝑀𝜋
over the state space 𝑆
• 𝑀𝜋
𝑠0, 𝑎0 𝑋 ≔ Σ𝑡≥0𝛾𝑡
Pr 𝑠𝑡+1 ∈ 𝑋 𝑠0, 𝑎0, 𝜋), ∀𝑋 ⊂ 𝑆, 𝜓𝜋
𝑠0, 𝑎0 = 𝑠′
.
𝑀𝜋
𝑠0, 𝑎0, 𝑑𝑠′
𝜑(𝑠′
)
Backgrounds (II) – Important Concept
8
 Zero-shot RL from successor features
 Given a basic feature map 𝜑: 𝑆 → ℝ𝑑
to be learned via another criterion, universal SFs learn the successor features of a
particular family of policies 𝜋𝑧 for 𝑧 ∈ ℝ𝑑
,
• 𝜓 𝑠0, 𝑎0, 𝑧 = 𝔼 Σ𝑡≥0𝛾𝑡
𝜑 𝑠𝑡+1 | 𝑠0, 𝑎0 , 𝜋𝑧 , 𝜋𝑧 𝑠 ≔ argmax𝑎𝜓 𝑠, 𝑎, 𝑧 𝑇
𝑧
 Once a reward function 𝑟 is revealed, few reward samples of explicit knowledge of the function 𝑟 are used to perform a
linear regression of 𝑟 onto the features 𝜑
• Namely, the 𝑧𝑟 ≔ argmin𝑧𝔼𝑠~𝜌 𝑟 𝑠 − 𝜑 𝑠 𝑇
𝑧 2
= 𝔼𝜌 𝜑𝜑𝑇 −1
𝔼𝜌 𝜑𝑟 => then the policy 𝜋𝑧 is returned
• This policy is guaranteed to be optimal for all rewards in the linear span of the features 𝜑
– If 𝑟 𝑠 = 𝜑 𝑠 𝑇
𝑤, ∀𝑠 ∈ 𝑆, then 𝑧𝑟 = 𝑤, and 𝜋𝑧𝑟
is the optimal policy for reward 𝑟
 Zero-shot RL from Forward-Backward representation (FB)
 Forward-backward representation look for 𝐹: 𝑆 × 𝐴 × ℝ𝑑
→ ℝ𝑑
and 𝐵: 𝑆 → ℝ𝑑
such that the long-term transition prob 𝑀𝜋𝑧
decompose as
• 𝑀𝜋𝑧 𝑠0, 𝑎0, 𝑑𝑠′
≈ 𝑭 𝒔𝟎, 𝒂𝟎, 𝒛 𝑻
𝑩 𝒔′
𝜌 𝑑𝑠′
, 𝜋𝑧 𝑠 ≔ argmax𝑎𝐹 𝑠, 𝑎, 𝑧 𝑇
𝑧
• In a finite space, 𝑀𝜋𝑧 could be decomposed as 𝑀𝜋𝑧 = 𝐹
𝑧
𝑇
𝐵𝑑𝑖𝑎𝑔(𝜌)
 Once a reward function 𝑟 is revealed, 𝑧𝑟 ≔ 𝔼𝑠~𝜌 𝑟 𝑠 𝐵(𝑠) from a few reward samples or from explicit knowledge of the
function 𝑟 are estimated(e.g. 𝑧𝑟 = 𝐵 𝑠 𝑡𝑜 𝑟𝑒𝑎𝑐ℎ 𝑠)
• Then the policy 𝜋𝑧𝑟
is returned.
• Any reward function 𝑟, the policy 𝜋𝑧𝑟
is optimal for 𝑟, with optimal Q-function 𝑄𝑟
⋆
= 𝐹 𝑠, 𝑎, 𝑧𝑟
𝑇
𝑧𝑟
Previous Strategies for Zero-Shot RL
9
 The authors suggested the novel losses used to train 𝝍 in SFs, and F, B in FB
 To obtain a full zero-shot RL algorithm, SFs must specify the basic feature 𝜑, thus they proposed the ten p
ossible choices based on existing or new representations for RL
 Learning the SF 𝝍𝐓𝐳 instead of 𝝍
 The successor feature 𝜓 satisfy the Bellman equation 𝜓𝜋 = 𝑃𝜑 + 𝛾𝑃𝜋𝜓𝜋, the collection of ordinary Bellman
Eq for each component of 𝜑 in BBQ-network
 Therefore 𝜓 𝑠, 𝑎, 𝑧 for each 𝑧 could be trained by minimizing the Bellman residuals as follows,
• 𝜓 𝑠𝑡, 𝑎𝑡, 𝑧 − 𝜑 𝑠𝑡+1 − 𝛾𝜓(𝑠𝑡+1, 𝜋𝑧 𝑠𝑡+1 , 𝑧
2
, 𝑧 𝑖𝑠 𝑟𝑎𝑛𝑑𝑜𝑚 𝑠𝑎𝑚𝑝𝑙𝑒𝑑 𝑏𝑦 𝑙𝑜𝑔𝑖𝑐
 They proposed the novel loss instead of the vector-valued Bellman residual above,
• ℒ 𝜓 ≔ 𝔼 𝑠𝑡,𝑎𝑡,𝑠𝑡+1 ~𝜌 𝜓(𝑠𝑡, 𝑎𝑡, 𝑧 𝑇
𝑧 − 𝜑 𝑠𝑡+1
𝑇
𝑧 − 𝛾𝜓 𝑠𝑡+1, 𝜋𝑧 𝑠𝑡+1 , 𝑧 𝑇
𝑧
2
for each 𝑧
 This trains 𝜓 ⋅, 𝑧 𝑇
𝑧 as the Q-function of reward 𝜑𝑇
𝑧, the only case needed, while training the full vector
𝜓(⋅, 𝑧) amounts to training the Q-functions of each policy 𝜋𝑧 for all rewards 𝜑𝑇𝑧′ for all 𝑧′ ∈ ℝ𝑑 including
𝑧′
≠ 𝑧.
Algorithms for SF and FB Representations (I)
10
 Learning the FB representations: the FB training loss
 The successor measure 𝑀𝜋
satisfies a Bellman-like equation, 𝑀𝜋
= 𝑃 + 𝛾𝑃𝜋𝑀𝜋
, as matrices in the finite ca
se and as measures in the general case in [Blier et al]
• For any policy 𝜋𝑧, the Q-function for the reward 𝑟 can be written 𝑄𝑟
𝜋𝑧
= 𝑀𝜋𝑧𝑟 in matrix form.
– This is equal to 𝐹
𝑧
𝑇
𝐵𝑑𝑖𝑎𝑔 𝜌 𝑟; thus assume that the 𝑧𝑟 ≔ 𝐵𝑑𝑖𝑎𝑔 𝜌 𝑟 = 𝔼𝑠~𝜌 𝐵 𝑠 𝑟(𝑠) , the Q-function is obtained as
𝑄𝑟
𝜋𝑧
= 𝐹
𝑧
𝑇
𝑧𝑟 for any 𝑧 ∈ ℝ𝑑
 FB could be learned by iteratively minimizing the Bellman residual on the parametric model 𝑀 = 𝐹𝑇𝐵𝜌.
• Using a suitable norm ⋅ 𝜌
for the bellman residual leads to a loss expressed as expectation from the dataset
• ℒ 𝐹, 𝐵 ≔ 𝐹𝑧
𝑇
𝐵𝜌 − (𝑃 + 𝛾𝑃𝜋𝑧
𝐹𝑧
𝑇
𝐵𝜌)
= 𝔼 𝑠𝑡,𝑎𝑡,𝑠𝑡+1 ~𝜌,𝑠′~𝜌 𝐹 𝑠𝑡, 𝑎𝑡, 𝑧 𝑇
𝐵 𝑠′ − 𝛾𝐹 𝑠𝑡+1, 𝜋𝑧 𝑠𝑡+1 , 𝑧 𝑇
𝐵 𝑠′ 2
− 2𝔼 𝑠𝑡,𝑎𝑡,𝑠𝑡+1 ~𝜌 𝐹 𝑠𝑡, 𝑎𝑡, 𝑧 𝑇
𝐵(𝑠𝑡+1) + 𝐶𝑜𝑛𝑠𝑡
– 𝑧 is random sampled by logic
 The authors proposed that the last term involves 𝐵(𝑠𝑡+1) instead of 𝐵(𝑠𝑡), because they used 𝑠𝑡+1 instead
of 𝑠𝑡 for the successor measure
Algorithms for SF and FB representations (II)
11
 Learning basic features 𝝋 for SF
 Any representation learning method could be used to supply 𝜑
• The authors suggested the 10 basic features and described the precise learning objective for each
 1. Random Feature (Rand)
• Using a non-trainable randomly initialized network as features
 2. Autoencoder (AEnc)
• Learning a decoder 𝑓: ℝ𝑑
→ 𝑆 to recover the state from its representation 𝜑
– min
𝑓,𝜑
𝔼𝑠~𝒟 𝑓 𝜑(𝑠) − 𝑠 2
 3. Inverse Curiosity Module (ICM)
• Aiming at extracting the controllable aspects of the environment
• Training an inverse dynamics model 𝑔: ℝ𝑑
× ℝ𝑑
→ 𝐴 to predict the action used for a transition between two consecu
tive states
– min
𝑔,𝜑
𝔼 𝑠𝑡,𝑎𝑡,𝑠𝑡+1 ~𝒟 𝑔 𝜑 𝑠𝑡 , 𝜑 𝑠𝑡+1 − 𝑎𝑡
2
Algorithms for SF and FB representations (III)
12
 Learning basic features 𝝋 for SF
 4. Transition model (Trans)
• Learning the one-step dynamics 𝑓: ℝ𝑑
× 𝐴 → 𝑆 that predicts the next state from the current state representation
– min
𝑓,𝜑
𝔼 𝑠𝑡,𝑎𝑡,𝑠𝑡+1 ~𝒟 𝑓 𝜑 𝑠𝑡 , 𝑎𝑡 − 𝑠𝑡+1
2
 5. Latent transition model (Latent)
• Learning the latent dynamics model but instead of predicting the next state, it predicts its representation
– min
𝑓,𝜑
𝔼 𝑠𝑡,𝑎𝑡,𝑠𝑡+1 ~𝒟 𝑓 𝜑 𝑠𝑡 , 𝑎𝑡 − 𝜑 𝑠𝑡+1
2
 6. Laplacian Eigenfunction (Lap)
• Wu et al consider the symmetrized MDP graph Laplacian induced by an exploratory policy 𝜋, defined as
ℒ = 𝐼 −
1
2
𝑃𝜋𝑑𝑖𝑎𝑔 𝜌 −1
+ 𝑑𝑖𝑎𝑔 𝜌 −1
𝑃𝜋
𝑇
• They propose to learn the eigenfunctions of ℒ via the spectral graph drawing objective as follows:
– min
𝜑
𝔼 𝑠𝑡,𝑠𝑡+1 ~𝒟 𝜑 𝑠𝑡 − 𝜑(𝑠𝑡+1)
2
+ 𝜆𝔼𝑠~𝒟,𝑠′~𝒟 𝜑 𝑠 𝑇
𝜑 𝑠′ 2
− 𝜑(𝑠) 2
2
− 𝜑(𝑠′
) 2
2
– Where the second term is an orthonormality regularization to ensure that 𝔼𝑠~𝜌 𝜑 𝑠 𝜑 𝑠 𝑇
≈ 𝐼
Algorithms for SF and FB representations (IV)
13
 Learning basic features 𝝋 for SF
 7. Low-Rank Approximation of P
• Learning the features by estimating a low-rank model of the transition probability densities: 𝑃 𝑑𝑠′
𝑠, 𝑎 ≈ 𝒳 𝑠, 𝑎 𝑇
𝜇 𝑠′
𝜌(𝑑𝑠′
).
• The corresponding loss on 𝒳𝑇
𝜇 − 𝑃/𝜌 could be expressed as
– min
𝒳,𝜇
𝔼 𝑠𝑡,𝑎𝑡 ~𝜌,𝑠′`~𝜌 𝒳 𝑠𝑡, 𝑎𝑡
𝑇
𝜇 𝑠′
−
𝑃 𝑑𝑠′|𝑠𝑡,𝑎𝑡
𝜌(𝑑𝑠′)
2
= 𝔼 𝑠𝑡,𝑎𝑡 ~𝜌
𝑠′~𝜌
𝒳 𝑠𝑡, 𝑎𝑡
𝑇
𝜇 𝑠′ 2
− 2𝔼 𝑠𝑡,𝑎𝑡,𝑠𝑡+1 ~𝜌 𝒳 𝑠𝑡, 𝑎𝑡
𝑇
𝜇(𝑠𝑡+1) + 𝐶
– This loss is also a special case of the FB loss by setting 𝛾 = 0, 𝑜𝑚𝑖𝑖𝑡𝑖𝑛𝑔 𝑧
 8. Contrastive Learning
• Learning the representations by pushing positive pairs closer together while keeping negative pairs part
– Here, two states are considered similar if they lie close on the same trajectory
• They proposed SimCLR-like objective as
– min
𝒳,𝜑
−𝔼𝑘~Geom 1−𝛾𝐶𝐿
𝑠𝑡,𝑠𝑡+𝑘 ~𝒟
log
exp cosine 𝒳 𝑠𝑡 ,𝜑 𝑠𝑡+𝑘
𝔼𝑠′~𝒟 exp cosine 𝒳 𝑠𝑡 ,𝜑 𝑠′
, cosine 𝑢, 𝑣 =
𝑢𝑇𝑣
𝑢 2
𝑣 2
 9. Low-Rank Approximation of SR
• Learning the features by estimating a low-rank model of the successor measure for exploration policy
– min
𝒳,𝜑
𝔼 𝑠𝑡,𝑠𝑡+1 ~𝒟
𝑠′~𝒟
𝒳 𝑠𝑡
𝑇
𝜑 𝑠′
− 𝛾𝒳 𝑠𝑡+1 𝜑 𝑠′
2
− 2𝔼 𝑠𝑡,𝑠𝑡+1 ~𝒟 𝒳 𝑠𝑡
𝑇
𝜑 𝑠𝑡+1 , 𝑤ℎ𝑒𝑟𝑒 𝒳 𝑎𝑛𝑑 𝜑 𝑎𝑟𝑒 𝑡𝑎𝑟𝑔𝑒𝑡 𝑣𝑒𝑟𝑠𝑖𝑜𝑛 𝑜𝑓 𝒳 𝑎𝑛𝑑 𝜑
Algorithms for SF and FB representations (V)
14
Experiments
15
Environments
 All the methods were tested in DeepMind Control Suite(ExORL Benchmarks)
 Tasks and environments
• Point-mass Maze
– State, action: 4/2 dim vectors
• Walker: a planner walker
– State, action: 24/6 dim vectors
• Cheetah: a running planar biped
– State, action: 17/6 dim vectors
• Quadruped: a four-leg ant navigating in 3D space
– State, action: 78/12 dim vectors
 Replay buffers
• RND: 10M training transition with RND
• APS: 10M training transition with APS
• Proto: 10M training transition with ProtoRL
Point-mass Maze / Walker
Cheetah / Quadruped
16
 Comparison with 11 methods(FB and 10 SF-based models) and offline/online TD3
 The performance of each method for each task in each env, averaged over the three buffers and 10 seeds
 Control group
• Online TD3: with task reward, and free environment interactions
• Offline TD3: with task reward, and training from the replay buffer
 FB and Lap show superior performance than other methods
• FB and LaP reached about 83% and 76% of supervised offline TD3 performance
Results – Zero-shot performance of proposed methods
Average scores over tasks for each env
Average plots of zero-shot results(task, env, buffer, seeds)
17
Thank you!
18
Q&A

More Related Content

Similar to Does Zero-Shot RL Exist

Specific topics in optimisation
Specific topics in optimisationSpecific topics in optimisation
Specific topics in optimisationFarzad Javidanrad
 
NIPS KANSAI Reading Group #5: State Aware Imitation Learning
NIPS KANSAI Reading Group #5: State Aware Imitation LearningNIPS KANSAI Reading Group #5: State Aware Imitation Learning
NIPS KANSAI Reading Group #5: State Aware Imitation LearningEiji Uchibe
 
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
Show, Attend and Tell: Neural Image Caption Generation with Visual AttentionShow, Attend and Tell: Neural Image Caption Generation with Visual Attention
Show, Attend and Tell: Neural Image Caption Generation with Visual AttentionEun Ji Lee
 
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)Universitat Politècnica de Catalunya
 
Learning a nonlinear embedding by preserving class neibourhood structure 최종
Learning a nonlinear embedding by preserving class neibourhood structure   최종Learning a nonlinear embedding by preserving class neibourhood structure   최종
Learning a nonlinear embedding by preserving class neibourhood structure 최종WooSung Choi
 
Linear regression, costs & gradient descent
Linear regression, costs & gradient descentLinear regression, costs & gradient descent
Linear regression, costs & gradient descentRevanth Kumar
 
Machine learning ppt and presentation code
Machine learning ppt and presentation codeMachine learning ppt and presentation code
Machine learning ppt and presentation codesharma239172
 
Optimum Engineering Design - Day 2b. Classical Optimization methods
Optimum Engineering Design - Day 2b. Classical Optimization methodsOptimum Engineering Design - Day 2b. Classical Optimization methods
Optimum Engineering Design - Day 2b. Classical Optimization methodsSantiagoGarridoBulln
 
Support vector machines
Support vector machinesSupport vector machines
Support vector machinesJinho Lee
 
Estimation Theory Class (Summary and Revision)
Estimation Theory Class (Summary and Revision)Estimation Theory Class (Summary and Revision)
Estimation Theory Class (Summary and Revision)Ahmad Gomaa
 
Review of Seiberg Witten duality.pptx
Review of Seiberg Witten duality.pptxReview of Seiberg Witten duality.pptx
Review of Seiberg Witten duality.pptxHassaan Saleem
 
Inverse Function.pptx
Inverse Function.pptxInverse Function.pptx
Inverse Function.pptxSerGeo5
 
MAT-314 Relations and Functions
MAT-314 Relations and FunctionsMAT-314 Relations and Functions
MAT-314 Relations and FunctionsKevin Johnson
 
Solving Poisson Equation using Conjugate Gradient Method and its implementation
Solving Poisson Equation using Conjugate Gradient Methodand its implementationSolving Poisson Equation using Conjugate Gradient Methodand its implementation
Solving Poisson Equation using Conjugate Gradient Method and its implementationJongsu "Liam" Kim
 
Robot, Learning From Data
Robot, Learning From DataRobot, Learning From Data
Robot, Learning From DataSungjoon Choi
 

Similar to Does Zero-Shot RL Exist (20)

Presentation
PresentationPresentation
Presentation
 
Specific topics in optimisation
Specific topics in optimisationSpecific topics in optimisation
Specific topics in optimisation
 
NIPS KANSAI Reading Group #5: State Aware Imitation Learning
NIPS KANSAI Reading Group #5: State Aware Imitation LearningNIPS KANSAI Reading Group #5: State Aware Imitation Learning
NIPS KANSAI Reading Group #5: State Aware Imitation Learning
 
Gradient Boosting
Gradient BoostingGradient Boosting
Gradient Boosting
 
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
Show, Attend and Tell: Neural Image Caption Generation with Visual AttentionShow, Attend and Tell: Neural Image Caption Generation with Visual Attention
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
 
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
 
Learning a nonlinear embedding by preserving class neibourhood structure 최종
Learning a nonlinear embedding by preserving class neibourhood structure   최종Learning a nonlinear embedding by preserving class neibourhood structure   최종
Learning a nonlinear embedding by preserving class neibourhood structure 최종
 
Basic calculus (i)
Basic calculus (i)Basic calculus (i)
Basic calculus (i)
 
Linear regression, costs & gradient descent
Linear regression, costs & gradient descentLinear regression, costs & gradient descent
Linear regression, costs & gradient descent
 
Restricted boltzmann machine
Restricted boltzmann machineRestricted boltzmann machine
Restricted boltzmann machine
 
Machine learning ppt and presentation code
Machine learning ppt and presentation codeMachine learning ppt and presentation code
Machine learning ppt and presentation code
 
Optimum Engineering Design - Day 2b. Classical Optimization methods
Optimum Engineering Design - Day 2b. Classical Optimization methodsOptimum Engineering Design - Day 2b. Classical Optimization methods
Optimum Engineering Design - Day 2b. Classical Optimization methods
 
Support vector machines
Support vector machinesSupport vector machines
Support vector machines
 
Estimation Theory Class (Summary and Revision)
Estimation Theory Class (Summary and Revision)Estimation Theory Class (Summary and Revision)
Estimation Theory Class (Summary and Revision)
 
E0561719
E0561719E0561719
E0561719
 
Review of Seiberg Witten duality.pptx
Review of Seiberg Witten duality.pptxReview of Seiberg Witten duality.pptx
Review of Seiberg Witten duality.pptx
 
Inverse Function.pptx
Inverse Function.pptxInverse Function.pptx
Inverse Function.pptx
 
MAT-314 Relations and Functions
MAT-314 Relations and FunctionsMAT-314 Relations and Functions
MAT-314 Relations and Functions
 
Solving Poisson Equation using Conjugate Gradient Method and its implementation
Solving Poisson Equation using Conjugate Gradient Methodand its implementationSolving Poisson Equation using Conjugate Gradient Methodand its implementation
Solving Poisson Equation using Conjugate Gradient Method and its implementation
 
Robot, Learning From Data
Robot, Learning From DataRobot, Learning From Data
Robot, Learning From Data
 

Recently uploaded

Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...mikehavy0
 
Predictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting TechniquesPredictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting TechniquesBoston Institute of Analytics
 
Simplify hybrid data integration at an enterprise scale. Integrate all your d...
Simplify hybrid data integration at an enterprise scale. Integrate all your d...Simplify hybrid data integration at an enterprise scale. Integrate all your d...
Simplify hybrid data integration at an enterprise scale. Integrate all your d...varanasisatyanvesh
 
sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444saurabvyas476
 
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarjSCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarjadimosmejiaslendon
 
jll-asia-pacific-capital-tracker-1q24.pdf
jll-asia-pacific-capital-tracker-1q24.pdfjll-asia-pacific-capital-tracker-1q24.pdf
jll-asia-pacific-capital-tracker-1q24.pdfjaytendertech
 
社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token PredictionNABLAS株式会社
 
DAA Assignment Solution.pdf is the best1
DAA Assignment Solution.pdf is the best1DAA Assignment Solution.pdf is the best1
DAA Assignment Solution.pdf is the best1sinhaabhiyanshu
 
Las implicancias del memorándum de entendimiento entre Codelco y SQM según la...
Las implicancias del memorándum de entendimiento entre Codelco y SQM según la...Las implicancias del memorándum de entendimiento entre Codelco y SQM según la...
Las implicancias del memorándum de entendimiento entre Codelco y SQM según la...Voces Mineras
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证acoha1
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareGraham Ware
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabiaahmedjiabur940
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATIONLakpaYanziSherpa
 
Huawei Ransomware Protection Storage Solution Technical Overview Presentation...
Huawei Ransomware Protection Storage Solution Technical Overview Presentation...Huawei Ransomware Protection Storage Solution Technical Overview Presentation...
Huawei Ransomware Protection Storage Solution Technical Overview Presentation...LuisMiguelPaz5
 
Bios of leading Astrologers & Researchers
Bios of leading Astrologers & ResearchersBios of leading Astrologers & Researchers
Bios of leading Astrologers & Researchersdarmandersingh4580
 
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...ThinkInnovation
 
DS Lecture-1 about discrete structure .ppt
DS Lecture-1 about discrete structure .pptDS Lecture-1 about discrete structure .ppt
DS Lecture-1 about discrete structure .pptTanveerAhmed817946
 

Recently uploaded (20)

Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
 
Predictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting TechniquesPredictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting Techniques
 
Simplify hybrid data integration at an enterprise scale. Integrate all your d...
Simplify hybrid data integration at an enterprise scale. Integrate all your d...Simplify hybrid data integration at an enterprise scale. Integrate all your d...
Simplify hybrid data integration at an enterprise scale. Integrate all your d...
 
sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444
 
Abortion pills in Riyadh Saudi Arabia| +966572737505 | Get Cytotec, Unwanted Kit
Abortion pills in Riyadh Saudi Arabia| +966572737505 | Get Cytotec, Unwanted KitAbortion pills in Riyadh Saudi Arabia| +966572737505 | Get Cytotec, Unwanted Kit
Abortion pills in Riyadh Saudi Arabia| +966572737505 | Get Cytotec, Unwanted Kit
 
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarjSCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
 
jll-asia-pacific-capital-tracker-1q24.pdf
jll-asia-pacific-capital-tracker-1q24.pdfjll-asia-pacific-capital-tracker-1q24.pdf
jll-asia-pacific-capital-tracker-1q24.pdf
 
社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction
 
DAA Assignment Solution.pdf is the best1
DAA Assignment Solution.pdf is the best1DAA Assignment Solution.pdf is the best1
DAA Assignment Solution.pdf is the best1
 
Las implicancias del memorándum de entendimiento entre Codelco y SQM según la...
Las implicancias del memorándum de entendimiento entre Codelco y SQM según la...Las implicancias del memorándum de entendimiento entre Codelco y SQM según la...
Las implicancias del memorándum de entendimiento entre Codelco y SQM según la...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
 
Huawei Ransomware Protection Storage Solution Technical Overview Presentation...
Huawei Ransomware Protection Storage Solution Technical Overview Presentation...Huawei Ransomware Protection Storage Solution Technical Overview Presentation...
Huawei Ransomware Protection Storage Solution Technical Overview Presentation...
 
Abortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
Abortion pills in Doha {{ QATAR }} +966572737505) Get CytotecAbortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
Abortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
 
Bios of leading Astrologers & Researchers
Bios of leading Astrologers & ResearchersBios of leading Astrologers & Researchers
Bios of leading Astrologers & Researchers
 
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
 
DS Lecture-1 about discrete structure .ppt
DS Lecture-1 about discrete structure .pptDS Lecture-1 about discrete structure .ppt
DS Lecture-1 about discrete structure .ppt
 

Does Zero-Shot RL Exist

  • 1. 1 Does Zero-Shot Reinforcement Learning Exist? 백승언, 김현성, 이도현 , 정강민 11 June, 2023
  • 2. 2  Introduction  Current Success of AI  Meta Reinforcement Learning  Does Zero-Shot Reinforcement Learning Exist?  Backgrounds  Previous Strategies for Zero-Shot RL  Algorithms for SF and FB Representations  Experiments  Environments  Results Contents
  • 4. 4  Problem setting of reinforcement learning, meta-learning  Reinforcement learning • Given certain MDP, lean a policy 𝜋 that maximize the expected discounted return 𝔼𝜋,𝑝0 Σ𝑡=0 ∞ 𝛾𝑡−1 𝑟𝑡 𝑠𝑡, 𝑎𝑡, 𝑠𝑡+1  Meta-learning • Given data from 𝒯 1, … , 𝒯 N, quickly solve new task 𝒯 𝑡𝑒𝑠𝑡  Problem setting of Meta-Reinforcement Learning(Meta-RL)  Setting 1: meta-learning with diverse goal(goal as a task) • 𝒯 𝑖 ≜ {𝒮, 𝒜, 𝑝 𝑠0 , 𝑝 𝑠′ 𝑠, 𝑎 , 𝑟 𝑠, 𝑎, 𝑔 , 𝑔𝑖}  Setting 2: meta-learning with RL tasks(MDP as a task) • 𝒯 𝑖 ≜ 𝒮𝑖, 𝒜𝑖, 𝑝𝑖 𝑠0 , 𝑝𝑖 𝑠′ 𝑠, 𝑎 , 𝑟𝑖(𝑠, 𝑎) Meta Reinforcement Learning Meta RL problem statement in CS-330(Finn)
  • 6. 6  Notation  Reward-free MDP • ℳ = (𝑆, 𝐴, 𝑃, 𝛾) is a reward-free Markov Decision Process(MDP) with state space 𝑆, action space 𝐴, transition probability 𝑃(𝑠′ |𝑠, 𝑎) from state 𝑠 to 𝑠′ given action 𝑎 and discount factor 0 < 𝛾 < 1  Problem statement  Goal of zero-shot RL is to compute a compact representation ℰ of the env by observing samples of reward-free transitions (𝑠𝑡, 𝑎𝑡, 𝑠𝑡+1) in this env  Once a reward function is specified later, the agent must use ℰ to immediately produce a good policy, via only elementary computation without any further planning or learning  Reward functions may be specified at test time either as a relatively small set of reward samples (𝑠𝑖, 𝑟𝑖), or as an explicit function 𝑠 → 𝑟(𝑠) Backgrounds (I) – Defining Zero-Shot RL
  • 7. 7  Successor representations (SR)  For a finite MDP, the successor representation 𝑀𝜋 (𝑠0, 𝑎0) of a state-action pair(𝑠0, 𝑎0) under a policy 𝜋, is defined as the discounted sum of future occurrences of each state • 𝑀𝜋 𝑠0, 𝑎0, 𝑠 ≔ 𝔼 Σ𝑡≥0𝛾𝑡 𝕀 𝑠𝑡+1 = 𝑠 | 𝑠0, 𝑎0 , 𝜋 , ∀𝑠 ∈ 𝑆  In matrix form, SRs can be written as 𝑀𝜋 = 𝑃Σ𝑡≥0𝛾𝑡𝑃𝜋 𝑡 = 𝑃 𝐼 − 𝛾𝑃𝜋 −1, 𝑃𝜋 𝑖𝑠 𝑡ℎ𝑒 𝑠𝑡𝑎𝑡𝑒 𝑡𝑟𝑎𝑛𝑠𝑖𝑡𝑖𝑜𝑛 𝑝𝑟𝑜𝑏 • 𝑀𝜋 satisfies the matrix Bellman Equation 𝑀𝜋 = 𝑃 + 𝛾𝑃𝜋𝑀𝜋 , the Q-function can be expressed as 𝑄𝑟 𝜋 = 𝑀𝜋 𝑟  Successor features (SFs)  Successor features extend SR to continuous MDPs by first assuming we are given a basic feature map 𝜑: 𝑆 → ℝ𝑑 that embeds states into 𝑑-dimensional space, and defining the expected discounted sum of future state features • 𝜓𝜋 𝑠0, 𝑎0 ≔ 𝔼 Σ𝑡≥0𝛾𝑡 𝜑 𝑠𝑡+1 | 𝑠0, 𝑎0, 𝜋  Successor measures (SMs)  Successor measures extend SRs to continuous spaces by treating the distribution of future visited states as a measure 𝑀𝜋 over the state space 𝑆 • 𝑀𝜋 𝑠0, 𝑎0 𝑋 ≔ Σ𝑡≥0𝛾𝑡 Pr 𝑠𝑡+1 ∈ 𝑋 𝑠0, 𝑎0, 𝜋), ∀𝑋 ⊂ 𝑆, 𝜓𝜋 𝑠0, 𝑎0 = 𝑠′ . 𝑀𝜋 𝑠0, 𝑎0, 𝑑𝑠′ 𝜑(𝑠′ ) Backgrounds (II) – Important Concept
  • 8. 8  Zero-shot RL from successor features  Given a basic feature map 𝜑: 𝑆 → ℝ𝑑 to be learned via another criterion, universal SFs learn the successor features of a particular family of policies 𝜋𝑧 for 𝑧 ∈ ℝ𝑑 , • 𝜓 𝑠0, 𝑎0, 𝑧 = 𝔼 Σ𝑡≥0𝛾𝑡 𝜑 𝑠𝑡+1 | 𝑠0, 𝑎0 , 𝜋𝑧 , 𝜋𝑧 𝑠 ≔ argmax𝑎𝜓 𝑠, 𝑎, 𝑧 𝑇 𝑧  Once a reward function 𝑟 is revealed, few reward samples of explicit knowledge of the function 𝑟 are used to perform a linear regression of 𝑟 onto the features 𝜑 • Namely, the 𝑧𝑟 ≔ argmin𝑧𝔼𝑠~𝜌 𝑟 𝑠 − 𝜑 𝑠 𝑇 𝑧 2 = 𝔼𝜌 𝜑𝜑𝑇 −1 𝔼𝜌 𝜑𝑟 => then the policy 𝜋𝑧 is returned • This policy is guaranteed to be optimal for all rewards in the linear span of the features 𝜑 – If 𝑟 𝑠 = 𝜑 𝑠 𝑇 𝑤, ∀𝑠 ∈ 𝑆, then 𝑧𝑟 = 𝑤, and 𝜋𝑧𝑟 is the optimal policy for reward 𝑟  Zero-shot RL from Forward-Backward representation (FB)  Forward-backward representation look for 𝐹: 𝑆 × 𝐴 × ℝ𝑑 → ℝ𝑑 and 𝐵: 𝑆 → ℝ𝑑 such that the long-term transition prob 𝑀𝜋𝑧 decompose as • 𝑀𝜋𝑧 𝑠0, 𝑎0, 𝑑𝑠′ ≈ 𝑭 𝒔𝟎, 𝒂𝟎, 𝒛 𝑻 𝑩 𝒔′ 𝜌 𝑑𝑠′ , 𝜋𝑧 𝑠 ≔ argmax𝑎𝐹 𝑠, 𝑎, 𝑧 𝑇 𝑧 • In a finite space, 𝑀𝜋𝑧 could be decomposed as 𝑀𝜋𝑧 = 𝐹 𝑧 𝑇 𝐵𝑑𝑖𝑎𝑔(𝜌)  Once a reward function 𝑟 is revealed, 𝑧𝑟 ≔ 𝔼𝑠~𝜌 𝑟 𝑠 𝐵(𝑠) from a few reward samples or from explicit knowledge of the function 𝑟 are estimated(e.g. 𝑧𝑟 = 𝐵 𝑠 𝑡𝑜 𝑟𝑒𝑎𝑐ℎ 𝑠) • Then the policy 𝜋𝑧𝑟 is returned. • Any reward function 𝑟, the policy 𝜋𝑧𝑟 is optimal for 𝑟, with optimal Q-function 𝑄𝑟 ⋆ = 𝐹 𝑠, 𝑎, 𝑧𝑟 𝑇 𝑧𝑟 Previous Strategies for Zero-Shot RL
  • 9. 9  The authors suggested the novel losses used to train 𝝍 in SFs, and F, B in FB  To obtain a full zero-shot RL algorithm, SFs must specify the basic feature 𝜑, thus they proposed the ten p ossible choices based on existing or new representations for RL  Learning the SF 𝝍𝐓𝐳 instead of 𝝍  The successor feature 𝜓 satisfy the Bellman equation 𝜓𝜋 = 𝑃𝜑 + 𝛾𝑃𝜋𝜓𝜋, the collection of ordinary Bellman Eq for each component of 𝜑 in BBQ-network  Therefore 𝜓 𝑠, 𝑎, 𝑧 for each 𝑧 could be trained by minimizing the Bellman residuals as follows, • 𝜓 𝑠𝑡, 𝑎𝑡, 𝑧 − 𝜑 𝑠𝑡+1 − 𝛾𝜓(𝑠𝑡+1, 𝜋𝑧 𝑠𝑡+1 , 𝑧 2 , 𝑧 𝑖𝑠 𝑟𝑎𝑛𝑑𝑜𝑚 𝑠𝑎𝑚𝑝𝑙𝑒𝑑 𝑏𝑦 𝑙𝑜𝑔𝑖𝑐  They proposed the novel loss instead of the vector-valued Bellman residual above, • ℒ 𝜓 ≔ 𝔼 𝑠𝑡,𝑎𝑡,𝑠𝑡+1 ~𝜌 𝜓(𝑠𝑡, 𝑎𝑡, 𝑧 𝑇 𝑧 − 𝜑 𝑠𝑡+1 𝑇 𝑧 − 𝛾𝜓 𝑠𝑡+1, 𝜋𝑧 𝑠𝑡+1 , 𝑧 𝑇 𝑧 2 for each 𝑧  This trains 𝜓 ⋅, 𝑧 𝑇 𝑧 as the Q-function of reward 𝜑𝑇 𝑧, the only case needed, while training the full vector 𝜓(⋅, 𝑧) amounts to training the Q-functions of each policy 𝜋𝑧 for all rewards 𝜑𝑇𝑧′ for all 𝑧′ ∈ ℝ𝑑 including 𝑧′ ≠ 𝑧. Algorithms for SF and FB Representations (I)
  • 10. 10  Learning the FB representations: the FB training loss  The successor measure 𝑀𝜋 satisfies a Bellman-like equation, 𝑀𝜋 = 𝑃 + 𝛾𝑃𝜋𝑀𝜋 , as matrices in the finite ca se and as measures in the general case in [Blier et al] • For any policy 𝜋𝑧, the Q-function for the reward 𝑟 can be written 𝑄𝑟 𝜋𝑧 = 𝑀𝜋𝑧𝑟 in matrix form. – This is equal to 𝐹 𝑧 𝑇 𝐵𝑑𝑖𝑎𝑔 𝜌 𝑟; thus assume that the 𝑧𝑟 ≔ 𝐵𝑑𝑖𝑎𝑔 𝜌 𝑟 = 𝔼𝑠~𝜌 𝐵 𝑠 𝑟(𝑠) , the Q-function is obtained as 𝑄𝑟 𝜋𝑧 = 𝐹 𝑧 𝑇 𝑧𝑟 for any 𝑧 ∈ ℝ𝑑  FB could be learned by iteratively minimizing the Bellman residual on the parametric model 𝑀 = 𝐹𝑇𝐵𝜌. • Using a suitable norm ⋅ 𝜌 for the bellman residual leads to a loss expressed as expectation from the dataset • ℒ 𝐹, 𝐵 ≔ 𝐹𝑧 𝑇 𝐵𝜌 − (𝑃 + 𝛾𝑃𝜋𝑧 𝐹𝑧 𝑇 𝐵𝜌) = 𝔼 𝑠𝑡,𝑎𝑡,𝑠𝑡+1 ~𝜌,𝑠′~𝜌 𝐹 𝑠𝑡, 𝑎𝑡, 𝑧 𝑇 𝐵 𝑠′ − 𝛾𝐹 𝑠𝑡+1, 𝜋𝑧 𝑠𝑡+1 , 𝑧 𝑇 𝐵 𝑠′ 2 − 2𝔼 𝑠𝑡,𝑎𝑡,𝑠𝑡+1 ~𝜌 𝐹 𝑠𝑡, 𝑎𝑡, 𝑧 𝑇 𝐵(𝑠𝑡+1) + 𝐶𝑜𝑛𝑠𝑡 – 𝑧 is random sampled by logic  The authors proposed that the last term involves 𝐵(𝑠𝑡+1) instead of 𝐵(𝑠𝑡), because they used 𝑠𝑡+1 instead of 𝑠𝑡 for the successor measure Algorithms for SF and FB representations (II)
  • 11. 11  Learning basic features 𝝋 for SF  Any representation learning method could be used to supply 𝜑 • The authors suggested the 10 basic features and described the precise learning objective for each  1. Random Feature (Rand) • Using a non-trainable randomly initialized network as features  2. Autoencoder (AEnc) • Learning a decoder 𝑓: ℝ𝑑 → 𝑆 to recover the state from its representation 𝜑 – min 𝑓,𝜑 𝔼𝑠~𝒟 𝑓 𝜑(𝑠) − 𝑠 2  3. Inverse Curiosity Module (ICM) • Aiming at extracting the controllable aspects of the environment • Training an inverse dynamics model 𝑔: ℝ𝑑 × ℝ𝑑 → 𝐴 to predict the action used for a transition between two consecu tive states – min 𝑔,𝜑 𝔼 𝑠𝑡,𝑎𝑡,𝑠𝑡+1 ~𝒟 𝑔 𝜑 𝑠𝑡 , 𝜑 𝑠𝑡+1 − 𝑎𝑡 2 Algorithms for SF and FB representations (III)
  • 12. 12  Learning basic features 𝝋 for SF  4. Transition model (Trans) • Learning the one-step dynamics 𝑓: ℝ𝑑 × 𝐴 → 𝑆 that predicts the next state from the current state representation – min 𝑓,𝜑 𝔼 𝑠𝑡,𝑎𝑡,𝑠𝑡+1 ~𝒟 𝑓 𝜑 𝑠𝑡 , 𝑎𝑡 − 𝑠𝑡+1 2  5. Latent transition model (Latent) • Learning the latent dynamics model but instead of predicting the next state, it predicts its representation – min 𝑓,𝜑 𝔼 𝑠𝑡,𝑎𝑡,𝑠𝑡+1 ~𝒟 𝑓 𝜑 𝑠𝑡 , 𝑎𝑡 − 𝜑 𝑠𝑡+1 2  6. Laplacian Eigenfunction (Lap) • Wu et al consider the symmetrized MDP graph Laplacian induced by an exploratory policy 𝜋, defined as ℒ = 𝐼 − 1 2 𝑃𝜋𝑑𝑖𝑎𝑔 𝜌 −1 + 𝑑𝑖𝑎𝑔 𝜌 −1 𝑃𝜋 𝑇 • They propose to learn the eigenfunctions of ℒ via the spectral graph drawing objective as follows: – min 𝜑 𝔼 𝑠𝑡,𝑠𝑡+1 ~𝒟 𝜑 𝑠𝑡 − 𝜑(𝑠𝑡+1) 2 + 𝜆𝔼𝑠~𝒟,𝑠′~𝒟 𝜑 𝑠 𝑇 𝜑 𝑠′ 2 − 𝜑(𝑠) 2 2 − 𝜑(𝑠′ ) 2 2 – Where the second term is an orthonormality regularization to ensure that 𝔼𝑠~𝜌 𝜑 𝑠 𝜑 𝑠 𝑇 ≈ 𝐼 Algorithms for SF and FB representations (IV)
  • 13. 13  Learning basic features 𝝋 for SF  7. Low-Rank Approximation of P • Learning the features by estimating a low-rank model of the transition probability densities: 𝑃 𝑑𝑠′ 𝑠, 𝑎 ≈ 𝒳 𝑠, 𝑎 𝑇 𝜇 𝑠′ 𝜌(𝑑𝑠′ ). • The corresponding loss on 𝒳𝑇 𝜇 − 𝑃/𝜌 could be expressed as – min 𝒳,𝜇 𝔼 𝑠𝑡,𝑎𝑡 ~𝜌,𝑠′`~𝜌 𝒳 𝑠𝑡, 𝑎𝑡 𝑇 𝜇 𝑠′ − 𝑃 𝑑𝑠′|𝑠𝑡,𝑎𝑡 𝜌(𝑑𝑠′) 2 = 𝔼 𝑠𝑡,𝑎𝑡 ~𝜌 𝑠′~𝜌 𝒳 𝑠𝑡, 𝑎𝑡 𝑇 𝜇 𝑠′ 2 − 2𝔼 𝑠𝑡,𝑎𝑡,𝑠𝑡+1 ~𝜌 𝒳 𝑠𝑡, 𝑎𝑡 𝑇 𝜇(𝑠𝑡+1) + 𝐶 – This loss is also a special case of the FB loss by setting 𝛾 = 0, 𝑜𝑚𝑖𝑖𝑡𝑖𝑛𝑔 𝑧  8. Contrastive Learning • Learning the representations by pushing positive pairs closer together while keeping negative pairs part – Here, two states are considered similar if they lie close on the same trajectory • They proposed SimCLR-like objective as – min 𝒳,𝜑 −𝔼𝑘~Geom 1−𝛾𝐶𝐿 𝑠𝑡,𝑠𝑡+𝑘 ~𝒟 log exp cosine 𝒳 𝑠𝑡 ,𝜑 𝑠𝑡+𝑘 𝔼𝑠′~𝒟 exp cosine 𝒳 𝑠𝑡 ,𝜑 𝑠′ , cosine 𝑢, 𝑣 = 𝑢𝑇𝑣 𝑢 2 𝑣 2  9. Low-Rank Approximation of SR • Learning the features by estimating a low-rank model of the successor measure for exploration policy – min 𝒳,𝜑 𝔼 𝑠𝑡,𝑠𝑡+1 ~𝒟 𝑠′~𝒟 𝒳 𝑠𝑡 𝑇 𝜑 𝑠′ − 𝛾𝒳 𝑠𝑡+1 𝜑 𝑠′ 2 − 2𝔼 𝑠𝑡,𝑠𝑡+1 ~𝒟 𝒳 𝑠𝑡 𝑇 𝜑 𝑠𝑡+1 , 𝑤ℎ𝑒𝑟𝑒 𝒳 𝑎𝑛𝑑 𝜑 𝑎𝑟𝑒 𝑡𝑎𝑟𝑔𝑒𝑡 𝑣𝑒𝑟𝑠𝑖𝑜𝑛 𝑜𝑓 𝒳 𝑎𝑛𝑑 𝜑 Algorithms for SF and FB representations (V)
  • 15. 15 Environments  All the methods were tested in DeepMind Control Suite(ExORL Benchmarks)  Tasks and environments • Point-mass Maze – State, action: 4/2 dim vectors • Walker: a planner walker – State, action: 24/6 dim vectors • Cheetah: a running planar biped – State, action: 17/6 dim vectors • Quadruped: a four-leg ant navigating in 3D space – State, action: 78/12 dim vectors  Replay buffers • RND: 10M training transition with RND • APS: 10M training transition with APS • Proto: 10M training transition with ProtoRL Point-mass Maze / Walker Cheetah / Quadruped
  • 16. 16  Comparison with 11 methods(FB and 10 SF-based models) and offline/online TD3  The performance of each method for each task in each env, averaged over the three buffers and 10 seeds  Control group • Online TD3: with task reward, and free environment interactions • Offline TD3: with task reward, and training from the replay buffer  FB and Lap show superior performance than other methods • FB and LaP reached about 83% and 76% of supervised offline TD3 performance Results – Zero-shot performance of proposed methods Average scores over tasks for each env Average plots of zero-shot results(task, env, buffer, seeds)

Editor's Notes

  1. 이제 본격적인 제가 선정한 논문의 알고리즘에 대해서 발표를 시작하겠습니다.
  2. 이 장에서는, 배경 지식으로써 on policy algorithm과 off policy algorithm에 대해 설명드리도록 하겠습니다. 먼저, On policy 알고리즘은 sample을 획득하는 behavioral policy pib와, 개선이 수행 될 target policy pi가 동일하거나, 비슷한 알고리즘입니다. 이 알고리즘들은 정책의 학습이 안정적이라는 장점이 있지만, 상대적으로 적은 샘플 효율성을 가지고 있습니다. 그에 반해 off policy 알고리즘들은 behavior policy pib와 update가 수행 될 target policy pi가 독립적이어도 되는 알고리즘으로써, replay buffer를 사용하기 때문에 sample 효율성이 높다는 장점과 함께, 상대적으로 학습이 불안정하다는 단점을 가지고 있습니다.
  3. 이 장에서는, 배경 지식으로써 on policy algorithm과 off policy algorithm에 대해 설명드리도록 하겠습니다. 먼저, On policy 알고리즘은 sample을 획득하는 behavioral policy pib와, 개선이 수행 될 target policy pi가 동일하거나, 비슷한 알고리즘입니다. 이 알고리즘들은 정책의 학습이 안정적이라는 장점이 있지만, 상대적으로 적은 샘플 효율성을 가지고 있습니다. 그에 반해 off policy 알고리즘들은 behavior policy pib와 update가 수행 될 target policy pi가 독립적이어도 되는 알고리즘으로써, replay buffer를 사용하기 때문에 sample 효율성이 높다는 장점과 함께, 상대적으로 학습이 불안정하다는 단점을 가지고 있습니다.
  4. 이제 본격적인 제가 선정한 논문의 알고리즘에 대해서 발표를 시작하겠습니다.
  5. 여기까지가 제가 준비한 gSDE의 발표였습니다. 제 발표를 들어 주셔서 감사합니다!
  6. 혹시 궁금하신 사항 있으시면 자유롭게 질문 부탁 드립니다.