SlideShare a Scribd company logo
1
Does Zero-Shot Reinforcement Learning Exist?
백승언, 김현성, 이도현 , 정강민
11 June, 2023
2
 Introduction
 Current Success of AI
 Meta Reinforcement Learning
 Does Zero-Shot Reinforcement Learning Exist?
 Backgrounds
 Previous Strategies for Zero-Shot RL
 Algorithms for SF and FB Representations
 Experiments
 Environments
 Results
Contents
3
Introduction
4
 Problem setting of reinforcement learning, meta-learning
 Reinforcement learning
• Given certain MDP, lean a policy 𝜋 that maximize the expected discounted return 𝔼𝜋,𝑝0
Σ𝑡=0
∞
𝛾𝑡−1
𝑟𝑡 𝑠𝑡, 𝑎𝑡, 𝑠𝑡+1
 Meta-learning
• Given data from 𝒯
1, … , 𝒯
N, quickly solve new task 𝒯
𝑡𝑒𝑠𝑡
 Problem setting of Meta-Reinforcement Learning(Meta-RL)
 Setting 1: meta-learning with diverse goal(goal as a task)
• 𝒯
𝑖 ≜ {𝒮, 𝒜, 𝑝 𝑠0 , 𝑝 𝑠′
𝑠, 𝑎 , 𝑟 𝑠, 𝑎, 𝑔 , 𝑔𝑖}
 Setting 2: meta-learning with RL tasks(MDP as a task)
• 𝒯
𝑖 ≜ 𝒮𝑖, 𝒜𝑖, 𝑝𝑖 𝑠0 , 𝑝𝑖 𝑠′
𝑠, 𝑎 , 𝑟𝑖(𝑠, 𝑎)
Meta Reinforcement Learning
Meta RL problem statement in CS-330(Finn)
5
Does Zero-Shot Reinforcement Learning
Exist?
6
 Notation
 Reward-free MDP
• ℳ = (𝑆, 𝐴, 𝑃, 𝛾) is a reward-free Markov Decision Process(MDP) with state space 𝑆, action space 𝐴, transition
probability 𝑃(𝑠′
|𝑠, 𝑎) from state 𝑠 to 𝑠′ given action 𝑎 and discount factor 0 < 𝛾 < 1
 Problem statement
 Goal of zero-shot RL is to compute a compact representation ℰ of the env by observing samples of
reward-free transitions (𝑠𝑡, 𝑎𝑡, 𝑠𝑡+1) in this env
 Once a reward function is specified later, the agent must use ℰ to immediately produce a good policy, via
only elementary computation without any further planning or learning
 Reward functions may be specified at test time either as a relatively small set of reward samples (𝑠𝑖, 𝑟𝑖), or
as an explicit function 𝑠 → 𝑟(𝑠)
Backgrounds (I) – Defining Zero-Shot RL
7
 Successor representations (SR)
 For a finite MDP, the successor representation 𝑀𝜋
(𝑠0, 𝑎0) of a state-action pair(𝑠0, 𝑎0) under a policy 𝜋, is
defined as the discounted sum of future occurrences of each state
• 𝑀𝜋
𝑠0, 𝑎0, 𝑠 ≔ 𝔼 Σ𝑡≥0𝛾𝑡
𝕀 𝑠𝑡+1 = 𝑠 | 𝑠0, 𝑎0 , 𝜋 , ∀𝑠 ∈ 𝑆
 In matrix form, SRs can be written as 𝑀𝜋 = 𝑃Σ𝑡≥0𝛾𝑡𝑃𝜋
𝑡 = 𝑃 𝐼 − 𝛾𝑃𝜋
−1, 𝑃𝜋 𝑖𝑠 𝑡ℎ𝑒 𝑠𝑡𝑎𝑡𝑒 𝑡𝑟𝑎𝑛𝑠𝑖𝑡𝑖𝑜𝑛 𝑝𝑟𝑜𝑏
• 𝑀𝜋
satisfies the matrix Bellman Equation 𝑀𝜋
= 𝑃 + 𝛾𝑃𝜋𝑀𝜋
, the Q-function can be expressed as 𝑄𝑟
𝜋
= 𝑀𝜋
𝑟
 Successor features (SFs)
 Successor features extend SR to continuous MDPs by first assuming we are given a basic feature map
𝜑: 𝑆 → ℝ𝑑
that embeds states into 𝑑-dimensional space, and defining the expected discounted sum of
future state features
• 𝜓𝜋
𝑠0, 𝑎0 ≔ 𝔼 Σ𝑡≥0𝛾𝑡
𝜑 𝑠𝑡+1 | 𝑠0, 𝑎0, 𝜋
 Successor measures (SMs)
 Successor measures extend SRs to continuous spaces by treating the distribution of future visited states
as a measure 𝑀𝜋
over the state space 𝑆
• 𝑀𝜋
𝑠0, 𝑎0 𝑋 ≔ Σ𝑡≥0𝛾𝑡
Pr 𝑠𝑡+1 ∈ 𝑋 𝑠0, 𝑎0, 𝜋), ∀𝑋 ⊂ 𝑆, 𝜓𝜋
𝑠0, 𝑎0 = 𝑠′
.
𝑀𝜋
𝑠0, 𝑎0, 𝑑𝑠′
𝜑(𝑠′
)
Backgrounds (II) – Important Concept
8
 Zero-shot RL from successor features
 Given a basic feature map 𝜑: 𝑆 → ℝ𝑑
to be learned via another criterion, universal SFs learn the successor features of a
particular family of policies 𝜋𝑧 for 𝑧 ∈ ℝ𝑑
,
• 𝜓 𝑠0, 𝑎0, 𝑧 = 𝔼 Σ𝑡≥0𝛾𝑡
𝜑 𝑠𝑡+1 | 𝑠0, 𝑎0 , 𝜋𝑧 , 𝜋𝑧 𝑠 ≔ argmax𝑎𝜓 𝑠, 𝑎, 𝑧 𝑇
𝑧
 Once a reward function 𝑟 is revealed, few reward samples of explicit knowledge of the function 𝑟 are used to perform a
linear regression of 𝑟 onto the features 𝜑
• Namely, the 𝑧𝑟 ≔ argmin𝑧𝔼𝑠~𝜌 𝑟 𝑠 − 𝜑 𝑠 𝑇
𝑧 2
= 𝔼𝜌 𝜑𝜑𝑇 −1
𝔼𝜌 𝜑𝑟 => then the policy 𝜋𝑧 is returned
• This policy is guaranteed to be optimal for all rewards in the linear span of the features 𝜑
– If 𝑟 𝑠 = 𝜑 𝑠 𝑇
𝑤, ∀𝑠 ∈ 𝑆, then 𝑧𝑟 = 𝑤, and 𝜋𝑧𝑟
is the optimal policy for reward 𝑟
 Zero-shot RL from Forward-Backward representation (FB)
 Forward-backward representation look for 𝐹: 𝑆 × 𝐴 × ℝ𝑑
→ ℝ𝑑
and 𝐵: 𝑆 → ℝ𝑑
such that the long-term transition prob 𝑀𝜋𝑧
decompose as
• 𝑀𝜋𝑧 𝑠0, 𝑎0, 𝑑𝑠′
≈ 𝑭 𝒔𝟎, 𝒂𝟎, 𝒛 𝑻
𝑩 𝒔′
𝜌 𝑑𝑠′
, 𝜋𝑧 𝑠 ≔ argmax𝑎𝐹 𝑠, 𝑎, 𝑧 𝑇
𝑧
• In a finite space, 𝑀𝜋𝑧 could be decomposed as 𝑀𝜋𝑧 = 𝐹
𝑧
𝑇
𝐵𝑑𝑖𝑎𝑔(𝜌)
 Once a reward function 𝑟 is revealed, 𝑧𝑟 ≔ 𝔼𝑠~𝜌 𝑟 𝑠 𝐵(𝑠) from a few reward samples or from explicit knowledge of the
function 𝑟 are estimated(e.g. 𝑧𝑟 = 𝐵 𝑠 𝑡𝑜 𝑟𝑒𝑎𝑐ℎ 𝑠)
• Then the policy 𝜋𝑧𝑟
is returned.
• Any reward function 𝑟, the policy 𝜋𝑧𝑟
is optimal for 𝑟, with optimal Q-function 𝑄𝑟
⋆
= 𝐹 𝑠, 𝑎, 𝑧𝑟
𝑇
𝑧𝑟
Previous Strategies for Zero-Shot RL
9
 The authors suggested the novel losses used to train 𝝍 in SFs, and F, B in FB
 To obtain a full zero-shot RL algorithm, SFs must specify the basic feature 𝜑, thus they proposed the ten p
ossible choices based on existing or new representations for RL
 Learning the SF 𝝍𝐓𝐳 instead of 𝝍
 The successor feature 𝜓 satisfy the Bellman equation 𝜓𝜋 = 𝑃𝜑 + 𝛾𝑃𝜋𝜓𝜋, the collection of ordinary Bellman
Eq for each component of 𝜑 in BBQ-network
 Therefore 𝜓 𝑠, 𝑎, 𝑧 for each 𝑧 could be trained by minimizing the Bellman residuals as follows,
• 𝜓 𝑠𝑡, 𝑎𝑡, 𝑧 − 𝜑 𝑠𝑡+1 − 𝛾𝜓(𝑠𝑡+1, 𝜋𝑧 𝑠𝑡+1 , 𝑧
2
, 𝑧 𝑖𝑠 𝑟𝑎𝑛𝑑𝑜𝑚 𝑠𝑎𝑚𝑝𝑙𝑒𝑑 𝑏𝑦 𝑙𝑜𝑔𝑖𝑐
 They proposed the novel loss instead of the vector-valued Bellman residual above,
• ℒ 𝜓 ≔ 𝔼 𝑠𝑡,𝑎𝑡,𝑠𝑡+1 ~𝜌 𝜓(𝑠𝑡, 𝑎𝑡, 𝑧 𝑇
𝑧 − 𝜑 𝑠𝑡+1
𝑇
𝑧 − 𝛾𝜓 𝑠𝑡+1, 𝜋𝑧 𝑠𝑡+1 , 𝑧 𝑇
𝑧
2
for each 𝑧
 This trains 𝜓 ⋅, 𝑧 𝑇
𝑧 as the Q-function of reward 𝜑𝑇
𝑧, the only case needed, while training the full vector
𝜓(⋅, 𝑧) amounts to training the Q-functions of each policy 𝜋𝑧 for all rewards 𝜑𝑇𝑧′ for all 𝑧′ ∈ ℝ𝑑 including
𝑧′
≠ 𝑧.
Algorithms for SF and FB Representations (I)
10
 Learning the FB representations: the FB training loss
 The successor measure 𝑀𝜋
satisfies a Bellman-like equation, 𝑀𝜋
= 𝑃 + 𝛾𝑃𝜋𝑀𝜋
, as matrices in the finite ca
se and as measures in the general case in [Blier et al]
• For any policy 𝜋𝑧, the Q-function for the reward 𝑟 can be written 𝑄𝑟
𝜋𝑧
= 𝑀𝜋𝑧𝑟 in matrix form.
– This is equal to 𝐹
𝑧
𝑇
𝐵𝑑𝑖𝑎𝑔 𝜌 𝑟; thus assume that the 𝑧𝑟 ≔ 𝐵𝑑𝑖𝑎𝑔 𝜌 𝑟 = 𝔼𝑠~𝜌 𝐵 𝑠 𝑟(𝑠) , the Q-function is obtained as
𝑄𝑟
𝜋𝑧
= 𝐹
𝑧
𝑇
𝑧𝑟 for any 𝑧 ∈ ℝ𝑑
 FB could be learned by iteratively minimizing the Bellman residual on the parametric model 𝑀 = 𝐹𝑇𝐵𝜌.
• Using a suitable norm ⋅ 𝜌
for the bellman residual leads to a loss expressed as expectation from the dataset
• ℒ 𝐹, 𝐵 ≔ 𝐹𝑧
𝑇
𝐵𝜌 − (𝑃 + 𝛾𝑃𝜋𝑧
𝐹𝑧
𝑇
𝐵𝜌)
= 𝔼 𝑠𝑡,𝑎𝑡,𝑠𝑡+1 ~𝜌,𝑠′~𝜌 𝐹 𝑠𝑡, 𝑎𝑡, 𝑧 𝑇
𝐵 𝑠′ − 𝛾𝐹 𝑠𝑡+1, 𝜋𝑧 𝑠𝑡+1 , 𝑧 𝑇
𝐵 𝑠′ 2
− 2𝔼 𝑠𝑡,𝑎𝑡,𝑠𝑡+1 ~𝜌 𝐹 𝑠𝑡, 𝑎𝑡, 𝑧 𝑇
𝐵(𝑠𝑡+1) + 𝐶𝑜𝑛𝑠𝑡
– 𝑧 is random sampled by logic
 The authors proposed that the last term involves 𝐵(𝑠𝑡+1) instead of 𝐵(𝑠𝑡), because they used 𝑠𝑡+1 instead
of 𝑠𝑡 for the successor measure
Algorithms for SF and FB representations (II)
11
 Learning basic features 𝝋 for SF
 Any representation learning method could be used to supply 𝜑
• The authors suggested the 10 basic features and described the precise learning objective for each
 1. Random Feature (Rand)
• Using a non-trainable randomly initialized network as features
 2. Autoencoder (AEnc)
• Learning a decoder 𝑓: ℝ𝑑
→ 𝑆 to recover the state from its representation 𝜑
– min
𝑓,𝜑
𝔼𝑠~𝒟 𝑓 𝜑(𝑠) − 𝑠 2
 3. Inverse Curiosity Module (ICM)
• Aiming at extracting the controllable aspects of the environment
• Training an inverse dynamics model 𝑔: ℝ𝑑
× ℝ𝑑
→ 𝐴 to predict the action used for a transition between two consecu
tive states
– min
𝑔,𝜑
𝔼 𝑠𝑡,𝑎𝑡,𝑠𝑡+1 ~𝒟 𝑔 𝜑 𝑠𝑡 , 𝜑 𝑠𝑡+1 − 𝑎𝑡
2
Algorithms for SF and FB representations (III)
12
 Learning basic features 𝝋 for SF
 4. Transition model (Trans)
• Learning the one-step dynamics 𝑓: ℝ𝑑
× 𝐴 → 𝑆 that predicts the next state from the current state representation
– min
𝑓,𝜑
𝔼 𝑠𝑡,𝑎𝑡,𝑠𝑡+1 ~𝒟 𝑓 𝜑 𝑠𝑡 , 𝑎𝑡 − 𝑠𝑡+1
2
 5. Latent transition model (Latent)
• Learning the latent dynamics model but instead of predicting the next state, it predicts its representation
– min
𝑓,𝜑
𝔼 𝑠𝑡,𝑎𝑡,𝑠𝑡+1 ~𝒟 𝑓 𝜑 𝑠𝑡 , 𝑎𝑡 − 𝜑 𝑠𝑡+1
2
 6. Laplacian Eigenfunction (Lap)
• Wu et al consider the symmetrized MDP graph Laplacian induced by an exploratory policy 𝜋, defined as
ℒ = 𝐼 −
1
2
𝑃𝜋𝑑𝑖𝑎𝑔 𝜌 −1
+ 𝑑𝑖𝑎𝑔 𝜌 −1
𝑃𝜋
𝑇
• They propose to learn the eigenfunctions of ℒ via the spectral graph drawing objective as follows:
– min
𝜑
𝔼 𝑠𝑡,𝑠𝑡+1 ~𝒟 𝜑 𝑠𝑡 − 𝜑(𝑠𝑡+1)
2
+ 𝜆𝔼𝑠~𝒟,𝑠′~𝒟 𝜑 𝑠 𝑇
𝜑 𝑠′ 2
− 𝜑(𝑠) 2
2
− 𝜑(𝑠′
) 2
2
– Where the second term is an orthonormality regularization to ensure that 𝔼𝑠~𝜌 𝜑 𝑠 𝜑 𝑠 𝑇
≈ 𝐼
Algorithms for SF and FB representations (IV)
13
 Learning basic features 𝝋 for SF
 7. Low-Rank Approximation of P
• Learning the features by estimating a low-rank model of the transition probability densities: 𝑃 𝑑𝑠′
𝑠, 𝑎 ≈ 𝒳 𝑠, 𝑎 𝑇
𝜇 𝑠′
𝜌(𝑑𝑠′
).
• The corresponding loss on 𝒳𝑇
𝜇 − 𝑃/𝜌 could be expressed as
– min
𝒳,𝜇
𝔼 𝑠𝑡,𝑎𝑡 ~𝜌,𝑠′`~𝜌 𝒳 𝑠𝑡, 𝑎𝑡
𝑇
𝜇 𝑠′
−
𝑃 𝑑𝑠′|𝑠𝑡,𝑎𝑡
𝜌(𝑑𝑠′)
2
= 𝔼 𝑠𝑡,𝑎𝑡 ~𝜌
𝑠′~𝜌
𝒳 𝑠𝑡, 𝑎𝑡
𝑇
𝜇 𝑠′ 2
− 2𝔼 𝑠𝑡,𝑎𝑡,𝑠𝑡+1 ~𝜌 𝒳 𝑠𝑡, 𝑎𝑡
𝑇
𝜇(𝑠𝑡+1) + 𝐶
– This loss is also a special case of the FB loss by setting 𝛾 = 0, 𝑜𝑚𝑖𝑖𝑡𝑖𝑛𝑔 𝑧
 8. Contrastive Learning
• Learning the representations by pushing positive pairs closer together while keeping negative pairs part
– Here, two states are considered similar if they lie close on the same trajectory
• They proposed SimCLR-like objective as
– min
𝒳,𝜑
−𝔼𝑘~Geom 1−𝛾𝐶𝐿
𝑠𝑡,𝑠𝑡+𝑘 ~𝒟
log
exp cosine 𝒳 𝑠𝑡 ,𝜑 𝑠𝑡+𝑘
𝔼𝑠′~𝒟 exp cosine 𝒳 𝑠𝑡 ,𝜑 𝑠′
, cosine 𝑢, 𝑣 =
𝑢𝑇𝑣
𝑢 2
𝑣 2
 9. Low-Rank Approximation of SR
• Learning the features by estimating a low-rank model of the successor measure for exploration policy
– min
𝒳,𝜑
𝔼 𝑠𝑡,𝑠𝑡+1 ~𝒟
𝑠′~𝒟
𝒳 𝑠𝑡
𝑇
𝜑 𝑠′
− 𝛾𝒳 𝑠𝑡+1 𝜑 𝑠′
2
− 2𝔼 𝑠𝑡,𝑠𝑡+1 ~𝒟 𝒳 𝑠𝑡
𝑇
𝜑 𝑠𝑡+1 , 𝑤ℎ𝑒𝑟𝑒 𝒳 𝑎𝑛𝑑 𝜑 𝑎𝑟𝑒 𝑡𝑎𝑟𝑔𝑒𝑡 𝑣𝑒𝑟𝑠𝑖𝑜𝑛 𝑜𝑓 𝒳 𝑎𝑛𝑑 𝜑
Algorithms for SF and FB representations (V)
14
Experiments
15
Environments
 All the methods were tested in DeepMind Control Suite(ExORL Benchmarks)
 Tasks and environments
• Point-mass Maze
– State, action: 4/2 dim vectors
• Walker: a planner walker
– State, action: 24/6 dim vectors
• Cheetah: a running planar biped
– State, action: 17/6 dim vectors
• Quadruped: a four-leg ant navigating in 3D space
– State, action: 78/12 dim vectors
 Replay buffers
• RND: 10M training transition with RND
• APS: 10M training transition with APS
• Proto: 10M training transition with ProtoRL
Point-mass Maze / Walker
Cheetah / Quadruped
16
 Comparison with 11 methods(FB and 10 SF-based models) and offline/online TD3
 The performance of each method for each task in each env, averaged over the three buffers and 10 seeds
 Control group
• Online TD3: with task reward, and free environment interactions
• Offline TD3: with task reward, and training from the replay buffer
 FB and Lap show superior performance than other methods
• FB and LaP reached about 83% and 76% of supervised offline TD3 performance
Results – Zero-shot performance of proposed methods
Average scores over tasks for each env
Average plots of zero-shot results(task, env, buffer, seeds)
17
Thank you!
18
Q&A

More Related Content

Similar to Does Zero-Shot RL Exist

Specific topics in optimisation
Specific topics in optimisationSpecific topics in optimisation
Specific topics in optimisation
Farzad Javidanrad
 
NIPS KANSAI Reading Group #5: State Aware Imitation Learning
NIPS KANSAI Reading Group #5: State Aware Imitation LearningNIPS KANSAI Reading Group #5: State Aware Imitation Learning
NIPS KANSAI Reading Group #5: State Aware Imitation Learning
Eiji Uchibe
 
Gradient Boosting
Gradient BoostingGradient Boosting
Gradient Boosting
Nghia Bui Van
 
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
Show, Attend and Tell: Neural Image Caption Generation with Visual AttentionShow, Attend and Tell: Neural Image Caption Generation with Visual Attention
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
Eun Ji Lee
 
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Universitat Politècnica de Catalunya
 
Learning a nonlinear embedding by preserving class neibourhood structure 최종
Learning a nonlinear embedding by preserving class neibourhood structure   최종Learning a nonlinear embedding by preserving class neibourhood structure   최종
Learning a nonlinear embedding by preserving class neibourhood structure 최종
WooSung Choi
 
Basic calculus (i)
Basic calculus (i)Basic calculus (i)
Basic calculus (i)
Farzad Javidanrad
 
Linear regression, costs & gradient descent
Linear regression, costs & gradient descentLinear regression, costs & gradient descent
Linear regression, costs & gradient descent
Revanth Kumar
 
Restricted boltzmann machine
Restricted boltzmann machineRestricted boltzmann machine
Restricted boltzmann machine
강민국 강민국
 
Machine learning ppt and presentation code
Machine learning ppt and presentation codeMachine learning ppt and presentation code
Machine learning ppt and presentation code
sharma239172
 
Optimum Engineering Design - Day 2b. Classical Optimization methods
Optimum Engineering Design - Day 2b. Classical Optimization methodsOptimum Engineering Design - Day 2b. Classical Optimization methods
Optimum Engineering Design - Day 2b. Classical Optimization methods
SantiagoGarridoBulln
 
Support vector machines
Support vector machinesSupport vector machines
Support vector machines
Jinho Lee
 
Estimation Theory Class (Summary and Revision)
Estimation Theory Class (Summary and Revision)Estimation Theory Class (Summary and Revision)
Estimation Theory Class (Summary and Revision)
Ahmad Gomaa
 
E0561719
E0561719E0561719
E0561719
IOSR Journals
 
Review of Seiberg Witten duality.pptx
Review of Seiberg Witten duality.pptxReview of Seiberg Witten duality.pptx
Review of Seiberg Witten duality.pptx
Hassaan Saleem
 
Inverse Function.pptx
Inverse Function.pptxInverse Function.pptx
Inverse Function.pptx
SerGeo5
 
MAT-314 Relations and Functions
MAT-314 Relations and FunctionsMAT-314 Relations and Functions
MAT-314 Relations and Functions
Kevin Johnson
 
Solving Poisson Equation using Conjugate Gradient Method and its implementation
Solving Poisson Equation using Conjugate Gradient Methodand its implementationSolving Poisson Equation using Conjugate Gradient Methodand its implementation
Solving Poisson Equation using Conjugate Gradient Method and its implementation
Jongsu "Liam" Kim
 
Robot, Learning From Data
Robot, Learning From DataRobot, Learning From Data
Robot, Learning From Data
Sungjoon Choi
 

Similar to Does Zero-Shot RL Exist (20)

Presentation
PresentationPresentation
Presentation
 
Specific topics in optimisation
Specific topics in optimisationSpecific topics in optimisation
Specific topics in optimisation
 
NIPS KANSAI Reading Group #5: State Aware Imitation Learning
NIPS KANSAI Reading Group #5: State Aware Imitation LearningNIPS KANSAI Reading Group #5: State Aware Imitation Learning
NIPS KANSAI Reading Group #5: State Aware Imitation Learning
 
Gradient Boosting
Gradient BoostingGradient Boosting
Gradient Boosting
 
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
Show, Attend and Tell: Neural Image Caption Generation with Visual AttentionShow, Attend and Tell: Neural Image Caption Generation with Visual Attention
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
 
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
 
Learning a nonlinear embedding by preserving class neibourhood structure 최종
Learning a nonlinear embedding by preserving class neibourhood structure   최종Learning a nonlinear embedding by preserving class neibourhood structure   최종
Learning a nonlinear embedding by preserving class neibourhood structure 최종
 
Basic calculus (i)
Basic calculus (i)Basic calculus (i)
Basic calculus (i)
 
Linear regression, costs & gradient descent
Linear regression, costs & gradient descentLinear regression, costs & gradient descent
Linear regression, costs & gradient descent
 
Restricted boltzmann machine
Restricted boltzmann machineRestricted boltzmann machine
Restricted boltzmann machine
 
Machine learning ppt and presentation code
Machine learning ppt and presentation codeMachine learning ppt and presentation code
Machine learning ppt and presentation code
 
Optimum Engineering Design - Day 2b. Classical Optimization methods
Optimum Engineering Design - Day 2b. Classical Optimization methodsOptimum Engineering Design - Day 2b. Classical Optimization methods
Optimum Engineering Design - Day 2b. Classical Optimization methods
 
Support vector machines
Support vector machinesSupport vector machines
Support vector machines
 
Estimation Theory Class (Summary and Revision)
Estimation Theory Class (Summary and Revision)Estimation Theory Class (Summary and Revision)
Estimation Theory Class (Summary and Revision)
 
E0561719
E0561719E0561719
E0561719
 
Review of Seiberg Witten duality.pptx
Review of Seiberg Witten duality.pptxReview of Seiberg Witten duality.pptx
Review of Seiberg Witten duality.pptx
 
Inverse Function.pptx
Inverse Function.pptxInverse Function.pptx
Inverse Function.pptx
 
MAT-314 Relations and Functions
MAT-314 Relations and FunctionsMAT-314 Relations and Functions
MAT-314 Relations and Functions
 
Solving Poisson Equation using Conjugate Gradient Method and its implementation
Solving Poisson Equation using Conjugate Gradient Methodand its implementationSolving Poisson Equation using Conjugate Gradient Methodand its implementation
Solving Poisson Equation using Conjugate Gradient Method and its implementation
 
Robot, Learning From Data
Robot, Learning From DataRobot, Learning From Data
Robot, Learning From Data
 

Recently uploaded

Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 

Recently uploaded (20)

Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 

Does Zero-Shot RL Exist

  • 1. 1 Does Zero-Shot Reinforcement Learning Exist? 백승언, 김현성, 이도현 , 정강민 11 June, 2023
  • 2. 2  Introduction  Current Success of AI  Meta Reinforcement Learning  Does Zero-Shot Reinforcement Learning Exist?  Backgrounds  Previous Strategies for Zero-Shot RL  Algorithms for SF and FB Representations  Experiments  Environments  Results Contents
  • 4. 4  Problem setting of reinforcement learning, meta-learning  Reinforcement learning • Given certain MDP, lean a policy 𝜋 that maximize the expected discounted return 𝔼𝜋,𝑝0 Σ𝑡=0 ∞ 𝛾𝑡−1 𝑟𝑡 𝑠𝑡, 𝑎𝑡, 𝑠𝑡+1  Meta-learning • Given data from 𝒯 1, … , 𝒯 N, quickly solve new task 𝒯 𝑡𝑒𝑠𝑡  Problem setting of Meta-Reinforcement Learning(Meta-RL)  Setting 1: meta-learning with diverse goal(goal as a task) • 𝒯 𝑖 ≜ {𝒮, 𝒜, 𝑝 𝑠0 , 𝑝 𝑠′ 𝑠, 𝑎 , 𝑟 𝑠, 𝑎, 𝑔 , 𝑔𝑖}  Setting 2: meta-learning with RL tasks(MDP as a task) • 𝒯 𝑖 ≜ 𝒮𝑖, 𝒜𝑖, 𝑝𝑖 𝑠0 , 𝑝𝑖 𝑠′ 𝑠, 𝑎 , 𝑟𝑖(𝑠, 𝑎) Meta Reinforcement Learning Meta RL problem statement in CS-330(Finn)
  • 6. 6  Notation  Reward-free MDP • ℳ = (𝑆, 𝐴, 𝑃, 𝛾) is a reward-free Markov Decision Process(MDP) with state space 𝑆, action space 𝐴, transition probability 𝑃(𝑠′ |𝑠, 𝑎) from state 𝑠 to 𝑠′ given action 𝑎 and discount factor 0 < 𝛾 < 1  Problem statement  Goal of zero-shot RL is to compute a compact representation ℰ of the env by observing samples of reward-free transitions (𝑠𝑡, 𝑎𝑡, 𝑠𝑡+1) in this env  Once a reward function is specified later, the agent must use ℰ to immediately produce a good policy, via only elementary computation without any further planning or learning  Reward functions may be specified at test time either as a relatively small set of reward samples (𝑠𝑖, 𝑟𝑖), or as an explicit function 𝑠 → 𝑟(𝑠) Backgrounds (I) – Defining Zero-Shot RL
  • 7. 7  Successor representations (SR)  For a finite MDP, the successor representation 𝑀𝜋 (𝑠0, 𝑎0) of a state-action pair(𝑠0, 𝑎0) under a policy 𝜋, is defined as the discounted sum of future occurrences of each state • 𝑀𝜋 𝑠0, 𝑎0, 𝑠 ≔ 𝔼 Σ𝑡≥0𝛾𝑡 𝕀 𝑠𝑡+1 = 𝑠 | 𝑠0, 𝑎0 , 𝜋 , ∀𝑠 ∈ 𝑆  In matrix form, SRs can be written as 𝑀𝜋 = 𝑃Σ𝑡≥0𝛾𝑡𝑃𝜋 𝑡 = 𝑃 𝐼 − 𝛾𝑃𝜋 −1, 𝑃𝜋 𝑖𝑠 𝑡ℎ𝑒 𝑠𝑡𝑎𝑡𝑒 𝑡𝑟𝑎𝑛𝑠𝑖𝑡𝑖𝑜𝑛 𝑝𝑟𝑜𝑏 • 𝑀𝜋 satisfies the matrix Bellman Equation 𝑀𝜋 = 𝑃 + 𝛾𝑃𝜋𝑀𝜋 , the Q-function can be expressed as 𝑄𝑟 𝜋 = 𝑀𝜋 𝑟  Successor features (SFs)  Successor features extend SR to continuous MDPs by first assuming we are given a basic feature map 𝜑: 𝑆 → ℝ𝑑 that embeds states into 𝑑-dimensional space, and defining the expected discounted sum of future state features • 𝜓𝜋 𝑠0, 𝑎0 ≔ 𝔼 Σ𝑡≥0𝛾𝑡 𝜑 𝑠𝑡+1 | 𝑠0, 𝑎0, 𝜋  Successor measures (SMs)  Successor measures extend SRs to continuous spaces by treating the distribution of future visited states as a measure 𝑀𝜋 over the state space 𝑆 • 𝑀𝜋 𝑠0, 𝑎0 𝑋 ≔ Σ𝑡≥0𝛾𝑡 Pr 𝑠𝑡+1 ∈ 𝑋 𝑠0, 𝑎0, 𝜋), ∀𝑋 ⊂ 𝑆, 𝜓𝜋 𝑠0, 𝑎0 = 𝑠′ . 𝑀𝜋 𝑠0, 𝑎0, 𝑑𝑠′ 𝜑(𝑠′ ) Backgrounds (II) – Important Concept
  • 8. 8  Zero-shot RL from successor features  Given a basic feature map 𝜑: 𝑆 → ℝ𝑑 to be learned via another criterion, universal SFs learn the successor features of a particular family of policies 𝜋𝑧 for 𝑧 ∈ ℝ𝑑 , • 𝜓 𝑠0, 𝑎0, 𝑧 = 𝔼 Σ𝑡≥0𝛾𝑡 𝜑 𝑠𝑡+1 | 𝑠0, 𝑎0 , 𝜋𝑧 , 𝜋𝑧 𝑠 ≔ argmax𝑎𝜓 𝑠, 𝑎, 𝑧 𝑇 𝑧  Once a reward function 𝑟 is revealed, few reward samples of explicit knowledge of the function 𝑟 are used to perform a linear regression of 𝑟 onto the features 𝜑 • Namely, the 𝑧𝑟 ≔ argmin𝑧𝔼𝑠~𝜌 𝑟 𝑠 − 𝜑 𝑠 𝑇 𝑧 2 = 𝔼𝜌 𝜑𝜑𝑇 −1 𝔼𝜌 𝜑𝑟 => then the policy 𝜋𝑧 is returned • This policy is guaranteed to be optimal for all rewards in the linear span of the features 𝜑 – If 𝑟 𝑠 = 𝜑 𝑠 𝑇 𝑤, ∀𝑠 ∈ 𝑆, then 𝑧𝑟 = 𝑤, and 𝜋𝑧𝑟 is the optimal policy for reward 𝑟  Zero-shot RL from Forward-Backward representation (FB)  Forward-backward representation look for 𝐹: 𝑆 × 𝐴 × ℝ𝑑 → ℝ𝑑 and 𝐵: 𝑆 → ℝ𝑑 such that the long-term transition prob 𝑀𝜋𝑧 decompose as • 𝑀𝜋𝑧 𝑠0, 𝑎0, 𝑑𝑠′ ≈ 𝑭 𝒔𝟎, 𝒂𝟎, 𝒛 𝑻 𝑩 𝒔′ 𝜌 𝑑𝑠′ , 𝜋𝑧 𝑠 ≔ argmax𝑎𝐹 𝑠, 𝑎, 𝑧 𝑇 𝑧 • In a finite space, 𝑀𝜋𝑧 could be decomposed as 𝑀𝜋𝑧 = 𝐹 𝑧 𝑇 𝐵𝑑𝑖𝑎𝑔(𝜌)  Once a reward function 𝑟 is revealed, 𝑧𝑟 ≔ 𝔼𝑠~𝜌 𝑟 𝑠 𝐵(𝑠) from a few reward samples or from explicit knowledge of the function 𝑟 are estimated(e.g. 𝑧𝑟 = 𝐵 𝑠 𝑡𝑜 𝑟𝑒𝑎𝑐ℎ 𝑠) • Then the policy 𝜋𝑧𝑟 is returned. • Any reward function 𝑟, the policy 𝜋𝑧𝑟 is optimal for 𝑟, with optimal Q-function 𝑄𝑟 ⋆ = 𝐹 𝑠, 𝑎, 𝑧𝑟 𝑇 𝑧𝑟 Previous Strategies for Zero-Shot RL
  • 9. 9  The authors suggested the novel losses used to train 𝝍 in SFs, and F, B in FB  To obtain a full zero-shot RL algorithm, SFs must specify the basic feature 𝜑, thus they proposed the ten p ossible choices based on existing or new representations for RL  Learning the SF 𝝍𝐓𝐳 instead of 𝝍  The successor feature 𝜓 satisfy the Bellman equation 𝜓𝜋 = 𝑃𝜑 + 𝛾𝑃𝜋𝜓𝜋, the collection of ordinary Bellman Eq for each component of 𝜑 in BBQ-network  Therefore 𝜓 𝑠, 𝑎, 𝑧 for each 𝑧 could be trained by minimizing the Bellman residuals as follows, • 𝜓 𝑠𝑡, 𝑎𝑡, 𝑧 − 𝜑 𝑠𝑡+1 − 𝛾𝜓(𝑠𝑡+1, 𝜋𝑧 𝑠𝑡+1 , 𝑧 2 , 𝑧 𝑖𝑠 𝑟𝑎𝑛𝑑𝑜𝑚 𝑠𝑎𝑚𝑝𝑙𝑒𝑑 𝑏𝑦 𝑙𝑜𝑔𝑖𝑐  They proposed the novel loss instead of the vector-valued Bellman residual above, • ℒ 𝜓 ≔ 𝔼 𝑠𝑡,𝑎𝑡,𝑠𝑡+1 ~𝜌 𝜓(𝑠𝑡, 𝑎𝑡, 𝑧 𝑇 𝑧 − 𝜑 𝑠𝑡+1 𝑇 𝑧 − 𝛾𝜓 𝑠𝑡+1, 𝜋𝑧 𝑠𝑡+1 , 𝑧 𝑇 𝑧 2 for each 𝑧  This trains 𝜓 ⋅, 𝑧 𝑇 𝑧 as the Q-function of reward 𝜑𝑇 𝑧, the only case needed, while training the full vector 𝜓(⋅, 𝑧) amounts to training the Q-functions of each policy 𝜋𝑧 for all rewards 𝜑𝑇𝑧′ for all 𝑧′ ∈ ℝ𝑑 including 𝑧′ ≠ 𝑧. Algorithms for SF and FB Representations (I)
  • 10. 10  Learning the FB representations: the FB training loss  The successor measure 𝑀𝜋 satisfies a Bellman-like equation, 𝑀𝜋 = 𝑃 + 𝛾𝑃𝜋𝑀𝜋 , as matrices in the finite ca se and as measures in the general case in [Blier et al] • For any policy 𝜋𝑧, the Q-function for the reward 𝑟 can be written 𝑄𝑟 𝜋𝑧 = 𝑀𝜋𝑧𝑟 in matrix form. – This is equal to 𝐹 𝑧 𝑇 𝐵𝑑𝑖𝑎𝑔 𝜌 𝑟; thus assume that the 𝑧𝑟 ≔ 𝐵𝑑𝑖𝑎𝑔 𝜌 𝑟 = 𝔼𝑠~𝜌 𝐵 𝑠 𝑟(𝑠) , the Q-function is obtained as 𝑄𝑟 𝜋𝑧 = 𝐹 𝑧 𝑇 𝑧𝑟 for any 𝑧 ∈ ℝ𝑑  FB could be learned by iteratively minimizing the Bellman residual on the parametric model 𝑀 = 𝐹𝑇𝐵𝜌. • Using a suitable norm ⋅ 𝜌 for the bellman residual leads to a loss expressed as expectation from the dataset • ℒ 𝐹, 𝐵 ≔ 𝐹𝑧 𝑇 𝐵𝜌 − (𝑃 + 𝛾𝑃𝜋𝑧 𝐹𝑧 𝑇 𝐵𝜌) = 𝔼 𝑠𝑡,𝑎𝑡,𝑠𝑡+1 ~𝜌,𝑠′~𝜌 𝐹 𝑠𝑡, 𝑎𝑡, 𝑧 𝑇 𝐵 𝑠′ − 𝛾𝐹 𝑠𝑡+1, 𝜋𝑧 𝑠𝑡+1 , 𝑧 𝑇 𝐵 𝑠′ 2 − 2𝔼 𝑠𝑡,𝑎𝑡,𝑠𝑡+1 ~𝜌 𝐹 𝑠𝑡, 𝑎𝑡, 𝑧 𝑇 𝐵(𝑠𝑡+1) + 𝐶𝑜𝑛𝑠𝑡 – 𝑧 is random sampled by logic  The authors proposed that the last term involves 𝐵(𝑠𝑡+1) instead of 𝐵(𝑠𝑡), because they used 𝑠𝑡+1 instead of 𝑠𝑡 for the successor measure Algorithms for SF and FB representations (II)
  • 11. 11  Learning basic features 𝝋 for SF  Any representation learning method could be used to supply 𝜑 • The authors suggested the 10 basic features and described the precise learning objective for each  1. Random Feature (Rand) • Using a non-trainable randomly initialized network as features  2. Autoencoder (AEnc) • Learning a decoder 𝑓: ℝ𝑑 → 𝑆 to recover the state from its representation 𝜑 – min 𝑓,𝜑 𝔼𝑠~𝒟 𝑓 𝜑(𝑠) − 𝑠 2  3. Inverse Curiosity Module (ICM) • Aiming at extracting the controllable aspects of the environment • Training an inverse dynamics model 𝑔: ℝ𝑑 × ℝ𝑑 → 𝐴 to predict the action used for a transition between two consecu tive states – min 𝑔,𝜑 𝔼 𝑠𝑡,𝑎𝑡,𝑠𝑡+1 ~𝒟 𝑔 𝜑 𝑠𝑡 , 𝜑 𝑠𝑡+1 − 𝑎𝑡 2 Algorithms for SF and FB representations (III)
  • 12. 12  Learning basic features 𝝋 for SF  4. Transition model (Trans) • Learning the one-step dynamics 𝑓: ℝ𝑑 × 𝐴 → 𝑆 that predicts the next state from the current state representation – min 𝑓,𝜑 𝔼 𝑠𝑡,𝑎𝑡,𝑠𝑡+1 ~𝒟 𝑓 𝜑 𝑠𝑡 , 𝑎𝑡 − 𝑠𝑡+1 2  5. Latent transition model (Latent) • Learning the latent dynamics model but instead of predicting the next state, it predicts its representation – min 𝑓,𝜑 𝔼 𝑠𝑡,𝑎𝑡,𝑠𝑡+1 ~𝒟 𝑓 𝜑 𝑠𝑡 , 𝑎𝑡 − 𝜑 𝑠𝑡+1 2  6. Laplacian Eigenfunction (Lap) • Wu et al consider the symmetrized MDP graph Laplacian induced by an exploratory policy 𝜋, defined as ℒ = 𝐼 − 1 2 𝑃𝜋𝑑𝑖𝑎𝑔 𝜌 −1 + 𝑑𝑖𝑎𝑔 𝜌 −1 𝑃𝜋 𝑇 • They propose to learn the eigenfunctions of ℒ via the spectral graph drawing objective as follows: – min 𝜑 𝔼 𝑠𝑡,𝑠𝑡+1 ~𝒟 𝜑 𝑠𝑡 − 𝜑(𝑠𝑡+1) 2 + 𝜆𝔼𝑠~𝒟,𝑠′~𝒟 𝜑 𝑠 𝑇 𝜑 𝑠′ 2 − 𝜑(𝑠) 2 2 − 𝜑(𝑠′ ) 2 2 – Where the second term is an orthonormality regularization to ensure that 𝔼𝑠~𝜌 𝜑 𝑠 𝜑 𝑠 𝑇 ≈ 𝐼 Algorithms for SF and FB representations (IV)
  • 13. 13  Learning basic features 𝝋 for SF  7. Low-Rank Approximation of P • Learning the features by estimating a low-rank model of the transition probability densities: 𝑃 𝑑𝑠′ 𝑠, 𝑎 ≈ 𝒳 𝑠, 𝑎 𝑇 𝜇 𝑠′ 𝜌(𝑑𝑠′ ). • The corresponding loss on 𝒳𝑇 𝜇 − 𝑃/𝜌 could be expressed as – min 𝒳,𝜇 𝔼 𝑠𝑡,𝑎𝑡 ~𝜌,𝑠′`~𝜌 𝒳 𝑠𝑡, 𝑎𝑡 𝑇 𝜇 𝑠′ − 𝑃 𝑑𝑠′|𝑠𝑡,𝑎𝑡 𝜌(𝑑𝑠′) 2 = 𝔼 𝑠𝑡,𝑎𝑡 ~𝜌 𝑠′~𝜌 𝒳 𝑠𝑡, 𝑎𝑡 𝑇 𝜇 𝑠′ 2 − 2𝔼 𝑠𝑡,𝑎𝑡,𝑠𝑡+1 ~𝜌 𝒳 𝑠𝑡, 𝑎𝑡 𝑇 𝜇(𝑠𝑡+1) + 𝐶 – This loss is also a special case of the FB loss by setting 𝛾 = 0, 𝑜𝑚𝑖𝑖𝑡𝑖𝑛𝑔 𝑧  8. Contrastive Learning • Learning the representations by pushing positive pairs closer together while keeping negative pairs part – Here, two states are considered similar if they lie close on the same trajectory • They proposed SimCLR-like objective as – min 𝒳,𝜑 −𝔼𝑘~Geom 1−𝛾𝐶𝐿 𝑠𝑡,𝑠𝑡+𝑘 ~𝒟 log exp cosine 𝒳 𝑠𝑡 ,𝜑 𝑠𝑡+𝑘 𝔼𝑠′~𝒟 exp cosine 𝒳 𝑠𝑡 ,𝜑 𝑠′ , cosine 𝑢, 𝑣 = 𝑢𝑇𝑣 𝑢 2 𝑣 2  9. Low-Rank Approximation of SR • Learning the features by estimating a low-rank model of the successor measure for exploration policy – min 𝒳,𝜑 𝔼 𝑠𝑡,𝑠𝑡+1 ~𝒟 𝑠′~𝒟 𝒳 𝑠𝑡 𝑇 𝜑 𝑠′ − 𝛾𝒳 𝑠𝑡+1 𝜑 𝑠′ 2 − 2𝔼 𝑠𝑡,𝑠𝑡+1 ~𝒟 𝒳 𝑠𝑡 𝑇 𝜑 𝑠𝑡+1 , 𝑤ℎ𝑒𝑟𝑒 𝒳 𝑎𝑛𝑑 𝜑 𝑎𝑟𝑒 𝑡𝑎𝑟𝑔𝑒𝑡 𝑣𝑒𝑟𝑠𝑖𝑜𝑛 𝑜𝑓 𝒳 𝑎𝑛𝑑 𝜑 Algorithms for SF and FB representations (V)
  • 15. 15 Environments  All the methods were tested in DeepMind Control Suite(ExORL Benchmarks)  Tasks and environments • Point-mass Maze – State, action: 4/2 dim vectors • Walker: a planner walker – State, action: 24/6 dim vectors • Cheetah: a running planar biped – State, action: 17/6 dim vectors • Quadruped: a four-leg ant navigating in 3D space – State, action: 78/12 dim vectors  Replay buffers • RND: 10M training transition with RND • APS: 10M training transition with APS • Proto: 10M training transition with ProtoRL Point-mass Maze / Walker Cheetah / Quadruped
  • 16. 16  Comparison with 11 methods(FB and 10 SF-based models) and offline/online TD3  The performance of each method for each task in each env, averaged over the three buffers and 10 seeds  Control group • Online TD3: with task reward, and free environment interactions • Offline TD3: with task reward, and training from the replay buffer  FB and Lap show superior performance than other methods • FB and LaP reached about 83% and 76% of supervised offline TD3 performance Results – Zero-shot performance of proposed methods Average scores over tasks for each env Average plots of zero-shot results(task, env, buffer, seeds)

Editor's Notes

  1. 이제 본격적인 제가 선정한 논문의 알고리즘에 대해서 발표를 시작하겠습니다.
  2. 이 장에서는, 배경 지식으로써 on policy algorithm과 off policy algorithm에 대해 설명드리도록 하겠습니다. 먼저, On policy 알고리즘은 sample을 획득하는 behavioral policy pib와, 개선이 수행 될 target policy pi가 동일하거나, 비슷한 알고리즘입니다. 이 알고리즘들은 정책의 학습이 안정적이라는 장점이 있지만, 상대적으로 적은 샘플 효율성을 가지고 있습니다. 그에 반해 off policy 알고리즘들은 behavior policy pib와 update가 수행 될 target policy pi가 독립적이어도 되는 알고리즘으로써, replay buffer를 사용하기 때문에 sample 효율성이 높다는 장점과 함께, 상대적으로 학습이 불안정하다는 단점을 가지고 있습니다.
  3. 이 장에서는, 배경 지식으로써 on policy algorithm과 off policy algorithm에 대해 설명드리도록 하겠습니다. 먼저, On policy 알고리즘은 sample을 획득하는 behavioral policy pib와, 개선이 수행 될 target policy pi가 동일하거나, 비슷한 알고리즘입니다. 이 알고리즘들은 정책의 학습이 안정적이라는 장점이 있지만, 상대적으로 적은 샘플 효율성을 가지고 있습니다. 그에 반해 off policy 알고리즘들은 behavior policy pib와 update가 수행 될 target policy pi가 독립적이어도 되는 알고리즘으로써, replay buffer를 사용하기 때문에 sample 효율성이 높다는 장점과 함께, 상대적으로 학습이 불안정하다는 단점을 가지고 있습니다.
  4. 이제 본격적인 제가 선정한 논문의 알고리즘에 대해서 발표를 시작하겠습니다.
  5. 여기까지가 제가 준비한 gSDE의 발표였습니다. 제 발표를 들어 주셔서 감사합니다!
  6. 혹시 궁금하신 사항 있으시면 자유롭게 질문 부탁 드립니다.