This document discusses zero-shot reinforcement learning. It defines zero-shot RL as using a compact representation of a reward-free environment to immediately produce a good policy when a reward function is specified, without further learning. It describes using successor features and forward-backward representations for this purpose. Algorithms are proposed to learn the successor features and forward-backward representations by minimizing Bellman residuals. Various methods for learning the basic feature representation used in successor features are also discussed.
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdftaeseon ryu
Train중 예상 Return 을 최대화하기 위해 알려지지 않은 환경에서 '탐색'과 '활용' 사이의 균형을 잘 이루는 것이 중요합니다. 이를 이상적으로 수행하는 '베이즈 최적 정책'은 환경 상태뿐만 아니라 에이전트가 환경에 대해 느끼는 불확실성에 따라 행동을 결정합니다. 하지만, 베이즈 최적 정책을 계산하는 것은 작은 작업들에 대해서조차 까다롭습니다. 이 논문에서는, 알려지지 않은 환경에서 근사적으로 추론을 수행하고, 그 불확실성을 행동 선택 과정에 직접 포함시키는 방법, 'variational Bayes-Adaptive Deep RL' (variBAD)를 소개합니다.
"Stochastic Optimal Control and Reinforcement Learning", invited to speak at the Nonlinear Dynamic Systems class taught by Prof. Frank Chong-woo Park, Seoul National University, December 4, 2019.
Generalized Laplace - Mellin Integral TransformationIJERA Editor
The main propose of this paper is to generalized Laplace-Mellin Integral Transformation in between the positive regions of real axis. We have derived some new properties and theorems .And give selected tables for Laplace-Mellin Integral Transformation.
본 논문에서는 분배형 강화학습(Distributional Reinforcement Learning)에서 벨만 다이내믹스를 통해 확률 분포를 학습하는 문제를 고려합니다. 이전 연구들은 각 반환 분포의 유한 개의 통계량을 신경망을 통해 학습하는 방법을 사용해왔으나, 이 방법은 반환 분포의 함수적 형태에 제한을 받아 제한적인 표현력을 가지며, 미리 정의된 통계량을 유지하는 것이 어려웠습니다. 본 논문에서는 이러한 제한을 없애기 위해 최대 평균 거리(Maximum Mean Discrepancy, MMD)라는 가설 검정 기술을 활용해 반환 분포의 결정론적인(의사 난수를 사용한) 표본들을 학습하는 방법을 제안합니다. 이를 통해 반환 분포와 벨만 타겟 간의 모든 모멘트(순간값)를 암묵적으로 일치시킴으로써 분배형 벨만 연산자의 수렴성을 보장하며, 분포 근사에 대한 유한 샘플 분석을 제시합니다. 실험 결과, 본 논문에서 제안한 방법은 분배형 강화학습의 기본 모델보다 우수한 성능을 보이며, Atari 게임에서 분산형 에이전트를 사용하지 않는 경우에도 최고 성적을 기록합니다.
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdftaeseon ryu
Train중 예상 Return 을 최대화하기 위해 알려지지 않은 환경에서 '탐색'과 '활용' 사이의 균형을 잘 이루는 것이 중요합니다. 이를 이상적으로 수행하는 '베이즈 최적 정책'은 환경 상태뿐만 아니라 에이전트가 환경에 대해 느끼는 불확실성에 따라 행동을 결정합니다. 하지만, 베이즈 최적 정책을 계산하는 것은 작은 작업들에 대해서조차 까다롭습니다. 이 논문에서는, 알려지지 않은 환경에서 근사적으로 추론을 수행하고, 그 불확실성을 행동 선택 과정에 직접 포함시키는 방법, 'variational Bayes-Adaptive Deep RL' (variBAD)를 소개합니다.
"Stochastic Optimal Control and Reinforcement Learning", invited to speak at the Nonlinear Dynamic Systems class taught by Prof. Frank Chong-woo Park, Seoul National University, December 4, 2019.
Generalized Laplace - Mellin Integral TransformationIJERA Editor
The main propose of this paper is to generalized Laplace-Mellin Integral Transformation in between the positive regions of real axis. We have derived some new properties and theorems .And give selected tables for Laplace-Mellin Integral Transformation.
본 논문에서는 분배형 강화학습(Distributional Reinforcement Learning)에서 벨만 다이내믹스를 통해 확률 분포를 학습하는 문제를 고려합니다. 이전 연구들은 각 반환 분포의 유한 개의 통계량을 신경망을 통해 학습하는 방법을 사용해왔으나, 이 방법은 반환 분포의 함수적 형태에 제한을 받아 제한적인 표현력을 가지며, 미리 정의된 통계량을 유지하는 것이 어려웠습니다. 본 논문에서는 이러한 제한을 없애기 위해 최대 평균 거리(Maximum Mean Discrepancy, MMD)라는 가설 검정 기술을 활용해 반환 분포의 결정론적인(의사 난수를 사용한) 표본들을 학습하는 방법을 제안합니다. 이를 통해 반환 분포와 벨만 타겟 간의 모든 모멘트(순간값)를 암묵적으로 일치시킴으로써 분배형 벨만 연산자의 수렴성을 보장하며, 분포 근사에 대한 유한 샘플 분석을 제시합니다. 실험 결과, 본 논문에서 제안한 방법은 분배형 강화학습의 기본 모델보다 우수한 성능을 보이며, Atari 게임에서 분산형 에이전트를 사용하지 않는 경우에도 최고 성적을 기록합니다.
https://telecombcn-dl.github.io/2017-dlai/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
Learning a nonlinear embedding by preserving class neibourhood structure 최종WooSung Choi
Salakhutdinov, Ruslan, and Geoffrey E. Hinton. "Learning a nonlinear embedding by preserving class neighbourhood structure." International Conference on Artificial Intelligence and Statistics. 2007.
Estimation Theory Class (Summary and Revision)Ahmad Gomaa
Summary of important theories and formulas in Estimation theory:
1) Cramer-Rao lower bound (CRLB)
2) Linear Model
3) Best Linear Unbiased Estimate (BLUE)
4) Maximum Likelihood Estimation (MLE)
5) Least Squares Estimation (LSE)
6) Bayesian Estimation and MMSE estimation
IOSR Journal of Mathematics(IOSR-JM) is an open access international journal that provides rapid publication (within a month) of articles in all areas of mathemetics and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in mathematics. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Robot, Learning from Data
1. Direct Policy Learning in RKHS with learning theory
2. Inverse Reinforcement Learning Methods
Sungjoon Choi (sungjoon.choi@cpslab.snu.ac.kr)
https://telecombcn-dl.github.io/2017-dlai/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
Learning a nonlinear embedding by preserving class neibourhood structure 최종WooSung Choi
Salakhutdinov, Ruslan, and Geoffrey E. Hinton. "Learning a nonlinear embedding by preserving class neighbourhood structure." International Conference on Artificial Intelligence and Statistics. 2007.
Estimation Theory Class (Summary and Revision)Ahmad Gomaa
Summary of important theories and formulas in Estimation theory:
1) Cramer-Rao lower bound (CRLB)
2) Linear Model
3) Best Linear Unbiased Estimate (BLUE)
4) Maximum Likelihood Estimation (MLE)
5) Least Squares Estimation (LSE)
6) Bayesian Estimation and MMSE estimation
IOSR Journal of Mathematics(IOSR-JM) is an open access international journal that provides rapid publication (within a month) of articles in all areas of mathemetics and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in mathematics. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Robot, Learning from Data
1. Direct Policy Learning in RKHS with learning theory
2. Inverse Reinforcement Learning Methods
Sungjoon Choi (sungjoon.choi@cpslab.snu.ac.kr)
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
2. 2
Introduction
Current Success of AI
Meta Reinforcement Learning
Does Zero-Shot Reinforcement Learning Exist?
Backgrounds
Previous Strategies for Zero-Shot RL
Algorithms for SF and FB Representations
Experiments
Environments
Results
Contents
4. 4
Problem setting of reinforcement learning, meta-learning
Reinforcement learning
• Given certain MDP, lean a policy 𝜋 that maximize the expected discounted return 𝔼𝜋,𝑝0
Σ𝑡=0
∞
𝛾𝑡−1
𝑟𝑡 𝑠𝑡, 𝑎𝑡, 𝑠𝑡+1
Meta-learning
• Given data from 𝒯
1, … , 𝒯
N, quickly solve new task 𝒯
𝑡𝑒𝑠𝑡
Problem setting of Meta-Reinforcement Learning(Meta-RL)
Setting 1: meta-learning with diverse goal(goal as a task)
• 𝒯
𝑖 ≜ {𝒮, 𝒜, 𝑝 𝑠0 , 𝑝 𝑠′
𝑠, 𝑎 , 𝑟 𝑠, 𝑎, 𝑔 , 𝑔𝑖}
Setting 2: meta-learning with RL tasks(MDP as a task)
• 𝒯
𝑖 ≜ 𝒮𝑖, 𝒜𝑖, 𝑝𝑖 𝑠0 , 𝑝𝑖 𝑠′
𝑠, 𝑎 , 𝑟𝑖(𝑠, 𝑎)
Meta Reinforcement Learning
Meta RL problem statement in CS-330(Finn)
6. 6
Notation
Reward-free MDP
• ℳ = (𝑆, 𝐴, 𝑃, 𝛾) is a reward-free Markov Decision Process(MDP) with state space 𝑆, action space 𝐴, transition
probability 𝑃(𝑠′
|𝑠, 𝑎) from state 𝑠 to 𝑠′ given action 𝑎 and discount factor 0 < 𝛾 < 1
Problem statement
Goal of zero-shot RL is to compute a compact representation ℰ of the env by observing samples of
reward-free transitions (𝑠𝑡, 𝑎𝑡, 𝑠𝑡+1) in this env
Once a reward function is specified later, the agent must use ℰ to immediately produce a good policy, via
only elementary computation without any further planning or learning
Reward functions may be specified at test time either as a relatively small set of reward samples (𝑠𝑖, 𝑟𝑖), or
as an explicit function 𝑠 → 𝑟(𝑠)
Backgrounds (I) – Defining Zero-Shot RL
7. 7
Successor representations (SR)
For a finite MDP, the successor representation 𝑀𝜋
(𝑠0, 𝑎0) of a state-action pair(𝑠0, 𝑎0) under a policy 𝜋, is
defined as the discounted sum of future occurrences of each state
• 𝑀𝜋
𝑠0, 𝑎0, 𝑠 ≔ 𝔼 Σ𝑡≥0𝛾𝑡
𝕀 𝑠𝑡+1 = 𝑠 | 𝑠0, 𝑎0 , 𝜋 , ∀𝑠 ∈ 𝑆
In matrix form, SRs can be written as 𝑀𝜋 = 𝑃Σ𝑡≥0𝛾𝑡𝑃𝜋
𝑡 = 𝑃 𝐼 − 𝛾𝑃𝜋
−1, 𝑃𝜋 𝑖𝑠 𝑡ℎ𝑒 𝑠𝑡𝑎𝑡𝑒 𝑡𝑟𝑎𝑛𝑠𝑖𝑡𝑖𝑜𝑛 𝑝𝑟𝑜𝑏
• 𝑀𝜋
satisfies the matrix Bellman Equation 𝑀𝜋
= 𝑃 + 𝛾𝑃𝜋𝑀𝜋
, the Q-function can be expressed as 𝑄𝑟
𝜋
= 𝑀𝜋
𝑟
Successor features (SFs)
Successor features extend SR to continuous MDPs by first assuming we are given a basic feature map
𝜑: 𝑆 → ℝ𝑑
that embeds states into 𝑑-dimensional space, and defining the expected discounted sum of
future state features
• 𝜓𝜋
𝑠0, 𝑎0 ≔ 𝔼 Σ𝑡≥0𝛾𝑡
𝜑 𝑠𝑡+1 | 𝑠0, 𝑎0, 𝜋
Successor measures (SMs)
Successor measures extend SRs to continuous spaces by treating the distribution of future visited states
as a measure 𝑀𝜋
over the state space 𝑆
• 𝑀𝜋
𝑠0, 𝑎0 𝑋 ≔ Σ𝑡≥0𝛾𝑡
Pr 𝑠𝑡+1 ∈ 𝑋 𝑠0, 𝑎0, 𝜋), ∀𝑋 ⊂ 𝑆, 𝜓𝜋
𝑠0, 𝑎0 = 𝑠′
.
𝑀𝜋
𝑠0, 𝑎0, 𝑑𝑠′
𝜑(𝑠′
)
Backgrounds (II) – Important Concept
8. 8
Zero-shot RL from successor features
Given a basic feature map 𝜑: 𝑆 → ℝ𝑑
to be learned via another criterion, universal SFs learn the successor features of a
particular family of policies 𝜋𝑧 for 𝑧 ∈ ℝ𝑑
,
• 𝜓 𝑠0, 𝑎0, 𝑧 = 𝔼 Σ𝑡≥0𝛾𝑡
𝜑 𝑠𝑡+1 | 𝑠0, 𝑎0 , 𝜋𝑧 , 𝜋𝑧 𝑠 ≔ argmax𝑎𝜓 𝑠, 𝑎, 𝑧 𝑇
𝑧
Once a reward function 𝑟 is revealed, few reward samples of explicit knowledge of the function 𝑟 are used to perform a
linear regression of 𝑟 onto the features 𝜑
• Namely, the 𝑧𝑟 ≔ argmin𝑧𝔼𝑠~𝜌 𝑟 𝑠 − 𝜑 𝑠 𝑇
𝑧 2
= 𝔼𝜌 𝜑𝜑𝑇 −1
𝔼𝜌 𝜑𝑟 => then the policy 𝜋𝑧 is returned
• This policy is guaranteed to be optimal for all rewards in the linear span of the features 𝜑
– If 𝑟 𝑠 = 𝜑 𝑠 𝑇
𝑤, ∀𝑠 ∈ 𝑆, then 𝑧𝑟 = 𝑤, and 𝜋𝑧𝑟
is the optimal policy for reward 𝑟
Zero-shot RL from Forward-Backward representation (FB)
Forward-backward representation look for 𝐹: 𝑆 × 𝐴 × ℝ𝑑
→ ℝ𝑑
and 𝐵: 𝑆 → ℝ𝑑
such that the long-term transition prob 𝑀𝜋𝑧
decompose as
• 𝑀𝜋𝑧 𝑠0, 𝑎0, 𝑑𝑠′
≈ 𝑭 𝒔𝟎, 𝒂𝟎, 𝒛 𝑻
𝑩 𝒔′
𝜌 𝑑𝑠′
, 𝜋𝑧 𝑠 ≔ argmax𝑎𝐹 𝑠, 𝑎, 𝑧 𝑇
𝑧
• In a finite space, 𝑀𝜋𝑧 could be decomposed as 𝑀𝜋𝑧 = 𝐹
𝑧
𝑇
𝐵𝑑𝑖𝑎𝑔(𝜌)
Once a reward function 𝑟 is revealed, 𝑧𝑟 ≔ 𝔼𝑠~𝜌 𝑟 𝑠 𝐵(𝑠) from a few reward samples or from explicit knowledge of the
function 𝑟 are estimated(e.g. 𝑧𝑟 = 𝐵 𝑠 𝑡𝑜 𝑟𝑒𝑎𝑐ℎ 𝑠)
• Then the policy 𝜋𝑧𝑟
is returned.
• Any reward function 𝑟, the policy 𝜋𝑧𝑟
is optimal for 𝑟, with optimal Q-function 𝑄𝑟
⋆
= 𝐹 𝑠, 𝑎, 𝑧𝑟
𝑇
𝑧𝑟
Previous Strategies for Zero-Shot RL
9. 9
The authors suggested the novel losses used to train 𝝍 in SFs, and F, B in FB
To obtain a full zero-shot RL algorithm, SFs must specify the basic feature 𝜑, thus they proposed the ten p
ossible choices based on existing or new representations for RL
Learning the SF 𝝍𝐓𝐳 instead of 𝝍
The successor feature 𝜓 satisfy the Bellman equation 𝜓𝜋 = 𝑃𝜑 + 𝛾𝑃𝜋𝜓𝜋, the collection of ordinary Bellman
Eq for each component of 𝜑 in BBQ-network
Therefore 𝜓 𝑠, 𝑎, 𝑧 for each 𝑧 could be trained by minimizing the Bellman residuals as follows,
• 𝜓 𝑠𝑡, 𝑎𝑡, 𝑧 − 𝜑 𝑠𝑡+1 − 𝛾𝜓(𝑠𝑡+1, 𝜋𝑧 𝑠𝑡+1 , 𝑧
2
, 𝑧 𝑖𝑠 𝑟𝑎𝑛𝑑𝑜𝑚 𝑠𝑎𝑚𝑝𝑙𝑒𝑑 𝑏𝑦 𝑙𝑜𝑔𝑖𝑐
They proposed the novel loss instead of the vector-valued Bellman residual above,
• ℒ 𝜓 ≔ 𝔼 𝑠𝑡,𝑎𝑡,𝑠𝑡+1 ~𝜌 𝜓(𝑠𝑡, 𝑎𝑡, 𝑧 𝑇
𝑧 − 𝜑 𝑠𝑡+1
𝑇
𝑧 − 𝛾𝜓 𝑠𝑡+1, 𝜋𝑧 𝑠𝑡+1 , 𝑧 𝑇
𝑧
2
for each 𝑧
This trains 𝜓 ⋅, 𝑧 𝑇
𝑧 as the Q-function of reward 𝜑𝑇
𝑧, the only case needed, while training the full vector
𝜓(⋅, 𝑧) amounts to training the Q-functions of each policy 𝜋𝑧 for all rewards 𝜑𝑇𝑧′ for all 𝑧′ ∈ ℝ𝑑 including
𝑧′
≠ 𝑧.
Algorithms for SF and FB Representations (I)
10. 10
Learning the FB representations: the FB training loss
The successor measure 𝑀𝜋
satisfies a Bellman-like equation, 𝑀𝜋
= 𝑃 + 𝛾𝑃𝜋𝑀𝜋
, as matrices in the finite ca
se and as measures in the general case in [Blier et al]
• For any policy 𝜋𝑧, the Q-function for the reward 𝑟 can be written 𝑄𝑟
𝜋𝑧
= 𝑀𝜋𝑧𝑟 in matrix form.
– This is equal to 𝐹
𝑧
𝑇
𝐵𝑑𝑖𝑎𝑔 𝜌 𝑟; thus assume that the 𝑧𝑟 ≔ 𝐵𝑑𝑖𝑎𝑔 𝜌 𝑟 = 𝔼𝑠~𝜌 𝐵 𝑠 𝑟(𝑠) , the Q-function is obtained as
𝑄𝑟
𝜋𝑧
= 𝐹
𝑧
𝑇
𝑧𝑟 for any 𝑧 ∈ ℝ𝑑
FB could be learned by iteratively minimizing the Bellman residual on the parametric model 𝑀 = 𝐹𝑇𝐵𝜌.
• Using a suitable norm ⋅ 𝜌
for the bellman residual leads to a loss expressed as expectation from the dataset
• ℒ 𝐹, 𝐵 ≔ 𝐹𝑧
𝑇
𝐵𝜌 − (𝑃 + 𝛾𝑃𝜋𝑧
𝐹𝑧
𝑇
𝐵𝜌)
= 𝔼 𝑠𝑡,𝑎𝑡,𝑠𝑡+1 ~𝜌,𝑠′~𝜌 𝐹 𝑠𝑡, 𝑎𝑡, 𝑧 𝑇
𝐵 𝑠′ − 𝛾𝐹 𝑠𝑡+1, 𝜋𝑧 𝑠𝑡+1 , 𝑧 𝑇
𝐵 𝑠′ 2
− 2𝔼 𝑠𝑡,𝑎𝑡,𝑠𝑡+1 ~𝜌 𝐹 𝑠𝑡, 𝑎𝑡, 𝑧 𝑇
𝐵(𝑠𝑡+1) + 𝐶𝑜𝑛𝑠𝑡
– 𝑧 is random sampled by logic
The authors proposed that the last term involves 𝐵(𝑠𝑡+1) instead of 𝐵(𝑠𝑡), because they used 𝑠𝑡+1 instead
of 𝑠𝑡 for the successor measure
Algorithms for SF and FB representations (II)
11. 11
Learning basic features 𝝋 for SF
Any representation learning method could be used to supply 𝜑
• The authors suggested the 10 basic features and described the precise learning objective for each
1. Random Feature (Rand)
• Using a non-trainable randomly initialized network as features
2. Autoencoder (AEnc)
• Learning a decoder 𝑓: ℝ𝑑
→ 𝑆 to recover the state from its representation 𝜑
– min
𝑓,𝜑
𝔼𝑠~𝒟 𝑓 𝜑(𝑠) − 𝑠 2
3. Inverse Curiosity Module (ICM)
• Aiming at extracting the controllable aspects of the environment
• Training an inverse dynamics model 𝑔: ℝ𝑑
× ℝ𝑑
→ 𝐴 to predict the action used for a transition between two consecu
tive states
– min
𝑔,𝜑
𝔼 𝑠𝑡,𝑎𝑡,𝑠𝑡+1 ~𝒟 𝑔 𝜑 𝑠𝑡 , 𝜑 𝑠𝑡+1 − 𝑎𝑡
2
Algorithms for SF and FB representations (III)
12. 12
Learning basic features 𝝋 for SF
4. Transition model (Trans)
• Learning the one-step dynamics 𝑓: ℝ𝑑
× 𝐴 → 𝑆 that predicts the next state from the current state representation
– min
𝑓,𝜑
𝔼 𝑠𝑡,𝑎𝑡,𝑠𝑡+1 ~𝒟 𝑓 𝜑 𝑠𝑡 , 𝑎𝑡 − 𝑠𝑡+1
2
5. Latent transition model (Latent)
• Learning the latent dynamics model but instead of predicting the next state, it predicts its representation
– min
𝑓,𝜑
𝔼 𝑠𝑡,𝑎𝑡,𝑠𝑡+1 ~𝒟 𝑓 𝜑 𝑠𝑡 , 𝑎𝑡 − 𝜑 𝑠𝑡+1
2
6. Laplacian Eigenfunction (Lap)
• Wu et al consider the symmetrized MDP graph Laplacian induced by an exploratory policy 𝜋, defined as
ℒ = 𝐼 −
1
2
𝑃𝜋𝑑𝑖𝑎𝑔 𝜌 −1
+ 𝑑𝑖𝑎𝑔 𝜌 −1
𝑃𝜋
𝑇
• They propose to learn the eigenfunctions of ℒ via the spectral graph drawing objective as follows:
– min
𝜑
𝔼 𝑠𝑡,𝑠𝑡+1 ~𝒟 𝜑 𝑠𝑡 − 𝜑(𝑠𝑡+1)
2
+ 𝜆𝔼𝑠~𝒟,𝑠′~𝒟 𝜑 𝑠 𝑇
𝜑 𝑠′ 2
− 𝜑(𝑠) 2
2
− 𝜑(𝑠′
) 2
2
– Where the second term is an orthonormality regularization to ensure that 𝔼𝑠~𝜌 𝜑 𝑠 𝜑 𝑠 𝑇
≈ 𝐼
Algorithms for SF and FB representations (IV)
13. 13
Learning basic features 𝝋 for SF
7. Low-Rank Approximation of P
• Learning the features by estimating a low-rank model of the transition probability densities: 𝑃 𝑑𝑠′
𝑠, 𝑎 ≈ 𝒳 𝑠, 𝑎 𝑇
𝜇 𝑠′
𝜌(𝑑𝑠′
).
• The corresponding loss on 𝒳𝑇
𝜇 − 𝑃/𝜌 could be expressed as
– min
𝒳,𝜇
𝔼 𝑠𝑡,𝑎𝑡 ~𝜌,𝑠′`~𝜌 𝒳 𝑠𝑡, 𝑎𝑡
𝑇
𝜇 𝑠′
−
𝑃 𝑑𝑠′|𝑠𝑡,𝑎𝑡
𝜌(𝑑𝑠′)
2
= 𝔼 𝑠𝑡,𝑎𝑡 ~𝜌
𝑠′~𝜌
𝒳 𝑠𝑡, 𝑎𝑡
𝑇
𝜇 𝑠′ 2
− 2𝔼 𝑠𝑡,𝑎𝑡,𝑠𝑡+1 ~𝜌 𝒳 𝑠𝑡, 𝑎𝑡
𝑇
𝜇(𝑠𝑡+1) + 𝐶
– This loss is also a special case of the FB loss by setting 𝛾 = 0, 𝑜𝑚𝑖𝑖𝑡𝑖𝑛𝑔 𝑧
8. Contrastive Learning
• Learning the representations by pushing positive pairs closer together while keeping negative pairs part
– Here, two states are considered similar if they lie close on the same trajectory
• They proposed SimCLR-like objective as
– min
𝒳,𝜑
−𝔼𝑘~Geom 1−𝛾𝐶𝐿
𝑠𝑡,𝑠𝑡+𝑘 ~𝒟
log
exp cosine 𝒳 𝑠𝑡 ,𝜑 𝑠𝑡+𝑘
𝔼𝑠′~𝒟 exp cosine 𝒳 𝑠𝑡 ,𝜑 𝑠′
, cosine 𝑢, 𝑣 =
𝑢𝑇𝑣
𝑢 2
𝑣 2
9. Low-Rank Approximation of SR
• Learning the features by estimating a low-rank model of the successor measure for exploration policy
– min
𝒳,𝜑
𝔼 𝑠𝑡,𝑠𝑡+1 ~𝒟
𝑠′~𝒟
𝒳 𝑠𝑡
𝑇
𝜑 𝑠′
− 𝛾𝒳 𝑠𝑡+1 𝜑 𝑠′
2
− 2𝔼 𝑠𝑡,𝑠𝑡+1 ~𝒟 𝒳 𝑠𝑡
𝑇
𝜑 𝑠𝑡+1 , 𝑤ℎ𝑒𝑟𝑒 𝒳 𝑎𝑛𝑑 𝜑 𝑎𝑟𝑒 𝑡𝑎𝑟𝑔𝑒𝑡 𝑣𝑒𝑟𝑠𝑖𝑜𝑛 𝑜𝑓 𝒳 𝑎𝑛𝑑 𝜑
Algorithms for SF and FB representations (V)
15. 15
Environments
All the methods were tested in DeepMind Control Suite(ExORL Benchmarks)
Tasks and environments
• Point-mass Maze
– State, action: 4/2 dim vectors
• Walker: a planner walker
– State, action: 24/6 dim vectors
• Cheetah: a running planar biped
– State, action: 17/6 dim vectors
• Quadruped: a four-leg ant navigating in 3D space
– State, action: 78/12 dim vectors
Replay buffers
• RND: 10M training transition with RND
• APS: 10M training transition with APS
• Proto: 10M training transition with ProtoRL
Point-mass Maze / Walker
Cheetah / Quadruped
16. 16
Comparison with 11 methods(FB and 10 SF-based models) and offline/online TD3
The performance of each method for each task in each env, averaged over the three buffers and 10 seeds
Control group
• Online TD3: with task reward, and free environment interactions
• Offline TD3: with task reward, and training from the replay buffer
FB and Lap show superior performance than other methods
• FB and LaP reached about 83% and 76% of supervised offline TD3 performance
Results – Zero-shot performance of proposed methods
Average scores over tasks for each env
Average plots of zero-shot results(task, env, buffer, seeds)
이 장에서는, 배경 지식으로써 on policy algorithm과 off policy algorithm에 대해 설명드리도록 하겠습니다.
먼저, On policy 알고리즘은 sample을 획득하는 behavioral policy pib와, 개선이 수행 될 target policy pi가 동일하거나, 비슷한 알고리즘입니다.
이 알고리즘들은 정책의 학습이 안정적이라는 장점이 있지만, 상대적으로 적은 샘플 효율성을 가지고 있습니다.
그에 반해 off policy 알고리즘들은 behavior policy pib와 update가 수행 될 target policy pi가 독립적이어도 되는 알고리즘으로써, replay buffer를 사용하기 때문에 sample 효율성이 높다는 장점과 함께, 상대적으로 학습이 불안정하다는 단점을 가지고 있습니다.
이 장에서는, 배경 지식으로써 on policy algorithm과 off policy algorithm에 대해 설명드리도록 하겠습니다.
먼저, On policy 알고리즘은 sample을 획득하는 behavioral policy pib와, 개선이 수행 될 target policy pi가 동일하거나, 비슷한 알고리즘입니다.
이 알고리즘들은 정책의 학습이 안정적이라는 장점이 있지만, 상대적으로 적은 샘플 효율성을 가지고 있습니다.
그에 반해 off policy 알고리즘들은 behavior policy pib와 update가 수행 될 target policy pi가 독립적이어도 되는 알고리즘으로써, replay buffer를 사용하기 때문에 sample 효율성이 높다는 장점과 함께, 상대적으로 학습이 불안정하다는 단점을 가지고 있습니다.