SlideShare a Scribd company logo
1
—Offline Reinforcement Learning: Tutorial, Review, and
Perspectives on Open Problems—
2020.06.26 Presenter: Tatsuya Matsushima @__tmats__ , Matsuo Lab
RL
• 

•
• …
• off-policy RL 

but distributional shift
•
2
Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open
Problems
• Sergey Levine, Aviral Kumar, George Tucker, Justin Fu
• UC Google Brain
• https://arxiv.org/abs/2005.01643 (Submitted on 4 May 2020)
• Sergey Levine
• Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review
• https://arxiv.org/abs/1805.00909
• RL
•
•
• RLArch 3
1.
2. RL
3. RL
•
4. DP RL
5. RL
•
6.
7.
※ *
4
RL
5
MDP
•
•
•
•
•
•
• POMDP 

• Horizon
•
•
ℳ = ( 𝒮, 𝒜, T, d0, r, γ)
s ∈ 𝒮
a ∈ 𝒜
T(st+1 |st, at)
d0(s0)
r : 𝒮 × 𝒜 → ℝ
γ ∈ (0,1]
π(at |st)
τ = (s0, a0, ⋯, sH, aH)
H
pπ(τ) = d0(s0)
H
∏
t=0
π(at |st)T(st+1 |st, at)
J(π) = 𝔼τ∼pπ(τ) [∑
H
t=0
γt
r (st, at)]
t s dt(st)
s d(s)
dt(st) 6
REINFORCE
•
πθ(at |st)
∇θJ (πθ) = 𝔼τ∼pπθ
(τ)
H
∑
t=0
γt
∇θlog πθ (at ∣ st)
(
H
∑
t′=t
γt′−t
r (st′, at′) − b (st)
)
=
H
∑
t=0
𝔼st∼dπ
t (st),at∼πθ(at ∣ st) [γt
∇θlog πθ (at ∣ st) ̂A (st, at)]
7
̂A(st, at)
DP Q-learning
• Q
Q 

•
Q-learning
•
•
Vπ
Qπ
Vπ
(st) = 𝔼τ∼pπ(τ ∣ st)
[
H
∑
t′=t
γt′−t
r (st, at)
]
Qπ
(st, at) = 𝔼τ∼pπ(τ ∣ st, at)
[
H
∑
t′=t
γt′−t
r (st, at)
]
Qπ
(st, at) = r (st, at) + γ𝔼st+1∼T(st+1 ∣ st, at),at+1∼π(at+1 ∣ st+1) [Qπ
(st+1, at+1)]
ℬπ
⃗Qπ
= ℬπ ⃗Qπ
Q⋆
(st, at) = r (st, at) + γ𝔼st+1∼T(st+1 ∣ st, at) [maxat+1
Q⋆
(st+1, at+1)]
ℬ*
⃗Qπ
= ℬ* ⃗Qπ
8
Actor-Critic
• 

Q
• 2
• Q
• Q 

πθ(at |st)
Qπ
(st, at) = r (st, at) + γ𝔼st+1∼T(st+1 ∣ st, at),at+1∼πθ(at+1 ∣ st+1) [Qπ
(st+1, at+1)]
9
On-policy vs Off-policy
On-policy
•
• SARSA
Off-policy
• 

• Q-learning experience replay
10
2. RL
11
RL
• ,
•
𝒟 = {(si
t, ai
t, si
t+1, ri
t)}
(s, a) ∈ 𝒟 s ∼ dπβ(s) a ∼ πβ(a|s)
πβ
12
[Lange+12]
RL
• batch RL RL
• RL DQN
• RL
• pure-batch RL 

• fully off-policy
• fully
• RL
•
13
RL
• exploration
• e bot
• exploration 

• 

• QT-Opt[Kalashnikov+18]
14
• 

•
• but open problem
• counterfactual
•
• distributional shift
• 

𝒟
𝒟
15
3. RL
16
off-policy off-policy policy evaluation, OPE
• [Thomas+15]
•
•
• unbiased
•
self-normalize biased
πθ
J(πθ)
J (πθ) = 𝔼τ∼πβ(τ)
[
πθ(τ)
πβ(τ)
H
∑
t=0
γt
r(s, a)
]
= 𝔼τ∼πβ(τ)
(
H
∏
t=0
πθ (at ∣ st)
πβ (at ∣ st))
H
∑
t=0
γt
r(s, a) ≈
n
∑
i=1
wi
H
H
∑
t=0
γt
ri
t
wi
t =
H
∏
t=0
πθ (at ∣ st)
πβ (at ∣ st)
{(si
t, ai
t, ri
t, si
t+1, ⋯)}
n
i=1
πβ n
n
∑
i=1
wi
H
17
off-policy off-policy policy evaluation, OPE
•
•
•
• self-normalization
• Doubly robust estimator
•
• Q
• or unbiased
J (πθ) = 𝔼τ∼πβ(τ)
[
∑
H
t=0 (
∏
t
t′=0
πθ(at ∣ st)
πβ(at ∣ st) )
γt
r(s, a)
]
≈ 1
n
∑
n
i=1
∑
H
t=0
wi
tγt
ri
t
rt t′ > t st, at
J (πθ) ≈ ∑
n
i=1
∑
H
t=0
γt
(
wi
t (ri
t − ̂Qπθ
(st, at)) − wi
t−1 𝔼a∼πθ(a ∣ st) [
̂Qπθ
(st, a)])
(st, at) ̂Q(st, at)
πβ
18
off-policy off-policy policy gradient
•
•
• self-normalization
πθ ∇θJ(πθ)
∇θJ (πθ) = 𝔼τ∼πβ(τ)
[
πθ(τ)
πβ(τ)
H
∑
t=0
γt
∇θlog πθ (at ∣ st) ̂A (st, at)
]
= 𝔼τ∼πβ(τ)
(
H
∏
t=0
πθ (at ∣ st)
β (at ∣ st) )
H
∑
t=0
γt
∇θlog πθ (at ∣ st) ̂A (st, at)
≈
n
∑
i=1
wi
H
H
∑
t=0
γt
∇θlog πθ (ai
t ∣ si
t) ̂A (si
t, ai
t)
{(si
t, ai
t, ri
t, si
t+1, ⋯)}
n
i=1
πβ n
19
off-policy off-policy policy gradient
•
•
•
• Doubly robust estimator
• off-policy
• Softmax
•
•
∇θJ (πθ) ≈ ∑
n
i=1
∑
H
t=0
wi
tγt
(
∑
H
t′=t
γt′−t
wi
t′
wi
t
rt′ − b (si
t))
∇θlog πθ (ai
t ∣ si
t)
∇θ
¯J (πθ) ≈ (∑
n
i=1
wi
H ∑
H
t=0
γt
∇θlog πθ (ai
t ∣ si
t) ̂A (si
t, ai
t)) + λ log (∑
n
i=1
wi
H)
20
off-policy
•
• biased sub-optimal [Imani+18] 

[Fu+19]
• 

• off-policy
•
• off-policy
• DRL DPG [Silver+14] IPG[Gu+17b]
dπ
(s) dπβ(s)
Jβ(πθ) = 𝔼s∼dπβ(s) [Vπ
(s)]
J(πθ)
dπβ(s) 𝒟
∇θJβ (πθ) = 𝔼s∼dβ(s),a∼πθ(a∣s) [Qπθ(s, a)∇θlog πθ(a ∣ s) + ∇θQπθ(s, a)]
≈ 𝔼s∼dβ(s),a∼πθ(a∣s) [Qπθ(s, a)∇θlog πθ(a ∣ s)]
Qπθ(s, a)
21
Marginalized Importance Sampling
• bias
• Forward Bellman Equation
•
• Backward Bellman Equation
•
• DualDICE[Nachum+19a] AlgaeDICE[Nachum+19b] [Nachum+20]
• RLArch
ρπθ(s) =
dπθ(s)
dβ(s)
ρπθ(s)
22
Marginalized Importance Sampling
• Forward Bellman Equation
•
•


• TD [Gelada+19]
•
•
ρπθ(s)
∀s′, dπβ (s′) ρπ
(s′)
(dπβ∘ρπ
)(s,a)
= (1 − γ)d0 (s′) + γ∑s,a
dπβ(s)ρπ
(s)π(a ∣ s)T (s′ ∣ s, a)
(ℬπ
∘ρπ
)(s,a)
̂ρπ
(s′) ← ̂ρπ
(s′) + α
[
(1 − γ) + γ
π(a ∣ s)
πβ(a ∣ s)
̂ρπ
(s) − ̂ρπ
(s′)
]
s ∼ dπβ a ∼ π(a|s) s′ ∼ T(s′|s, a)
23
Marginalized Importance Sampling
• Forward Bellman Equation
• [Liu+18]
•
• [Mousavi+20, Kallus+19, Tang+19, Uehara+19]
• GenDICE[Zhang+20]
• 1 

f-divergence
min
ρ
max
f
L(ρ, f )2
L(ρ, f ) = γ𝔼s,a,s′∼𝒟
[(
ρ(s)
π(a ∣ s)
πβ(a ∣ s)
− ρ (s′)
)
f (s′)
]
+ (1 − γ)𝔼s0∼d0
[(1 − ρ(s))f(s)]
πβ
𝒟
min
ρπ
Df ((ℬπ
∘ ρπ
)(s, a), (dπβ ∘ ρπ
)(s, a)) s.t. Es,a,s′∼𝒟 [ρπ
(s, a)] = 1
24
RL off-policy
•
RL 

•
• 

πβ
πβ πθ
πθ
πβ
25
off-policy 

DP state-marginal
• RL OoD
• DP
•
26
4.DP RL
27
DP
•
•
•
•
• Q-learning Q
• RL Q
• [Agarwal+19]
dπβ(s) dπ
(s)
π(a|s)
arg max Q(s, a)
28
• policy constraint
•
• uncertainty-based
• Q epistemic uncertainty
π πβ
29


Q OoD
• actor
• Q
• explicit f-
• implicit f-
• integral probability metric (IPM)
π(a′|s′) πβ(a′|s′)
30
• actor
•
• Q
•
̂Qπ
k+1 ← arg min
Q
𝔼(s,a,s′)∼𝒟
(
Q(s, a) −
(
r(s, a) + γ𝔼a′∼πk(a′ ∣ s′) [
̂Qπ
k (s′, a′)]))
2
πk+1 ← arg max
π
𝔼s∼𝒟 [
𝔼a∼π(a∣s) [
̂Qπ
k+1(s, a)]]
s.t. D (π, πβ) ≤ ϵ
̂Qπ
k+1 ← arg min
Q
𝔼(s,a,s′)∼𝒟
(
Q(s, a) −
(
r(s, a) + γ𝔼a′∼πk(a′ ∣ s′) [
̂Qπ
k (s′, a′)] − αγD (πk( ⋅ ∣ s′), πβ( ⋅ ∣ s′))))
2
πk+1 ← arg max
π
𝔼s∼𝒟
[
𝔼a∼π(a∣s) [
̂Qπ
k+1(s, a) − αD (π( ⋅ ∣ s), πβ( ⋅ ∣ s))]] 31
• explicit f-
• KL WOP[Jaques+19] BRAC[Wu+19]
• implicit f-
• KL AWR[Peng+19] ABM[Siegel+20]
•
•
• integral probability metric (IPM)
• MMD BEAR [Kumar+19] Wasserstein BRAC[Wu+19]
¯πk+1(a ∣ s) ←
1
Z
πβ(a ∣ s)exp
(
1
α
Qπ
k (s, a)
)
πk+1 ← arg min
π
DKL (¯πk+1, π)
πβ(a|s) exp
(
1
α
Qπ
k (s, a)
)
32
• RL KL(f-divergence)
• uniform RL 

KL exploit 

• 



• 

[Kumar+19]
33
Q epstemic uncertainty
•
• Q
Q
• Q BEAR [Kumar+19]
• Q [Agarwal+19]
πk+1 ← arg maxπ 𝔼s∼𝒟
[
𝔼a∼π(a∣s) [
𝔼Qπ
k+1∼𝒫 𝒟(Qπ
) [Qπ
k+1(s, a)] − α Unc ( 𝒫 𝒟 (Qπ
))]]
conservative estimate 
𝒫 𝒟(Qπ
) 𝒟 Unc
Unc
Unc
34
calibration
•
• [Fujimoto+18a]
• Q
•
•
• open problem
𝒫 𝒟 Unc
𝒟 πβ
𝒟
35
5. RL
36
•
• MPC [Tassa+12]
• BPTT PILCO[Deisenroth+11]
• Dyna[Sutton+91] PIPPS[Parmas+19]
•
RL
•
• distribution shift
• RL surprisingly limited
•
Tψ (st+1 ∣ st, at)
37
RL


NeurIPS2020
MOReL [Kidambi+20*]
•
• Pessimistic MDP
• 

MOPO [Yu+20*]
•
•
BREMEN [Matsushima+20*]
• 

+ TRPO
• RL 5-10% 

38
long-horizon MDP open problem
• RL
• DFP[Dosovitskiy+17], VPN[Oh+17], BADGR[Kahn+20]
• DP RL
• fitted value iteration
[Vanseijen+15,Parr+08]
39
7.
40
RL counterfactual
•
• i.i.d
• [Schölkopf19] [Gal+16] [Kingma+14] 

distributional robustness[Sinha+17] invariance[Arjovsky+19]
• 

visual foresight[Finn+17,Ebert+18] InteractionNet[Battaglia+16]
RL RL
41
42
RL
•
• D4RL[Kumar+20] RL Unplugged [Gulcehre+20*]
•
• IRIS[Mandlekar+19*]
•
• off-policy RL 

• Control as Inference RL
•
• BC AWR[Peng+19]
• GAIL[Ho+16*]
• T-REX[Brown+19a*] D-REX[Brown+19b*] 43
[Agarwal+19] Rishabh Agarwal, Dale Schuurmans, Mohammad Norouzi. “An Optimistic Perspective on Offline Reinforcement Learning” https://arxiv.org/abs/
1907.04543
[Arjovsky+19] Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, David Lopez-Paz. "Invariant Risk Minimization”. https://arxiv.org/abs/1907.02893
[Battaglia+16] Peter W. Battaglia, Razvan Pascanu, Matthew Lai, Danilo Rezende, Koray Kavukcuoglu. “Interaction Networks for Learning about Objects, Relations
and Physics”. https://arxiv.org/abs/1612.00222
[Brown+19a*] Daniel S. Brown, Wonjoon Goo, Prabhat Nagarajan, Scott Niekum. “Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement
Learning from Observations”. https://arxiv.org/abs/1904.06387
[Brown+19b*] Daniel S. Brown, Wonjoon Goo, Scott Niekum. “Better-than-Demonstrator Imitation Learning via Automatically-Ranked Demonstrations”. https://arxiv.org/
abs/1907.03976
[Deisenroth+11] Marc Deisenroth, Carl E. Rasmussen. "PILCO: A model-based and data-efficient approach to policy search." Proceedings of the 28th International
Conference on machine learning. 2011.
[Dosovitskiy+17] Alexey Dosovitskiy, Vladlen Koltun. “Learning to Act by Predicting the Future”. https://arxiv.org/abs/1611.01779
[Ebert+18] Frederik Ebert, Chelsea Finn, Sudeep Dasari, Annie Xie, Alex Lee, Sergey Levine. “Visual Foresight: Model-Based Deep Reinforcement Learning for Vision-
Based Robotic Control”. https://arxiv.org/abs/1812.00568
[Finn+17] Chelsea Finn, Sergey Levine. “Deep Visual Foresight for Planning Robot Motion”. https://arxiv.org/abs/1610.00696
[Fu+19] Justin Fu, Aviral Kumar, Matthew Soh, Sergey Levine. “Diagnosing bottlenecks in deep Q-learning algorithms”. https://arxiv.org/abs/1902.10250
[Fujimoto+18a] Scott Fujimoto, David Meger, Doina Precup. “Off-Policy Deep Reinforcement Learning without Exploration”. https://arxiv.org/abs/1812.02900
[Gal+16] Yarin Gal, Zoubin Ghahramani. “Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning”. https://arxiv.org/abs/1506.02142
[Gelada+19] Carles Gelada, Marc G. Bellemare. “Off-policy deep reinforcement learning by bootstrapping the covariate shift”. https://arxiv.org/abs/1901.09455
[Gu+17] Shixiang Gu, Timothy Lillicrap, Zoubin Ghahramani, Richard E. Turner, Bernhard Schölkopf, Sergey Levine. “Interpolated policy gradient: Merging on-policy
and off-policy gradient estimation for deep reinforcement learning”. https://arxiv.org/abs/1706.00387
[Gulcehre+20*] Caglar Gulcehre, Ziyu Wang, Alexander Novikov, Tom Le Paine, Sergio Gómez Colmenarejo, Konrad Zolna, Rishabh Agarwal, Josh Merel, Daniel
Mankowitz, Cosmin Paduraru, Gabriel Dulac-Arnold, Jerry Li, Mohammad Norouzi, Matt Hoffman, Ofir Nachum, George Tucker, Nicolas Heess, Nando de Freitas. “RL
Unplugged: Benchmarks for Offline Reinforcement Learning”. https://arxiv.org/abs/2006.13888
[Ho+16*] Jonathan Ho, Stefano Ermon. “Generative Adversarial Imitation Learning”. https://arxiv.org/abs/1606.03476
[Imani+18] Ehsan Imani, Eric Graves, Martha White. “An Off-policy Policy Gradient Theorem Using Emphatic Weightings”. https://arxiv.org/abs/1811.09013 44
[Jaques+19] Natasha Jaques, Asma Ghandeharioun, Judy Hanwen Shen, Craig Ferguson, Agata Lapedriza, Noah Jones, Shixiang Gu, Rosalind Picard. “Way Off-Policy
Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog”. https://arxiv.org/abs/1907.00456
[Kahn+20] Gregory Kahn, Pieter Abbeel, Sergey Levine. “BADGR: An Autonomous Self-Supervised Learning-Based Navigation System”. https://arxiv.org/abs/2002.05700
[Kalashnikov+18] Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vincent
Vanhoucke, Sergey Levine. “QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation”. https://arxiv.org/abs/1806.10293
[Kallus+19] Nathan Kallus, Masatoshi Uehara. “Efficiently Breaking the Curse of Horizon in Off-Policy Evaluation with Double Reinforcement Learning”. https://arxiv.org/
abs/1909.05850
[Kidambi+20*] Rahul Kidambi, Aravind Rajeswaran, Praneeth Netrapalli, Thorsten Joachims. “MOReL : Model-Based Offline Reinforcement Learning”. https://arxiv.org/
abs/2005.05951
[Kingma+14] Diederik P. Kingma, Danilo J. Rezende, Shakir Mohamed, Max Welling. “Semi-Supervised Learning with Deep Generative Models” https://arxiv.org/abs/
1406.5298
[Kumar+19] Aviral Kumar, Justin Fu, George Tucker, Sergey Levine. “Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction”. https://arxiv.org/abs/1906.00949
[Kumar+20] Does on-policy data collection fix errors in reinforcement learning? https://bair.berkeley.edu/blog/2020/03/16/discor/.
[Lange+12] Sascha Lange, Thomas Gabel, Martin Riedmiller. “Batch reinforcement learning”. Reinforcement learning. Springer, 2012. 45-73.
[Liu+18] Qiang Liu, Lihong Li, Ziyang Tang, Dengyong Zhou. “Breaking the curse of horizon: Infinite-horizon off-policy estimation”. https://arxiv.org/abs/1810.12429
[Mandlekar+19*] Ajay Mandlekar, Fabio Ramos, Byron Boots, Silvio Savarese, Li Fei-Fei, Animesh Garg, Dieter Fox. “IRIS: Implicit Reinforcement without Interaction at
Scale for Learning Control from Offline Robot Manipulation Data”. https://arxiv.org/abs/1911.05321
[Matsushima+20*] Tatsuya Matsushima, Hiroki Furuta, Yutaka Matsuo, Ofir Nachum, Shixiang Gu. “Deployment-Efficient Reinforcement Learning via Model-Based Offline
Optimization”. https://arxiv.org/abs/2006.03647
[Mousavi+20] Ali Mousavi, Lihong Li, Qiang Liu, Denny Zhou. “Black-box Off-policy Estimation for Infinite-Horizon Reinforcement Learning”. https://arxiv.org/abs/
2003.11126
[Nachum+19a] Ofir Nachum, Yinlam Chow, Bo Dai, Lihong Li. “DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections”. https://
arxiv.org/abs/1906.04733
[Nachum+19b] Ofir Nachum, Bo Dai, Ilya Kostrikov, Yinlam Chow, Lihong Li, Dale Schuurmans. “AlgaeDICE: Policy Gradient from Arbitrary Experience”. https://arxiv.org/
abs/1912.02074
[Nachum+20] Ofir Nachum, Bo Dai. “Reinforcement Learning via Fenchel-Rockafellar Duality”. https://arxiv.org/abs/2001.01866
45
[Oh+17] Junhyuk Oh, Satinder Singh, Honglak Lee. “Value Prediction Network”. https://arxiv.org/abs/1707.03497
[Parr+08] Parr, Ronald, Lihong Li, Gavin Taylor, Christopher Painter-Wakefield, Michael L. Littman. “An analysis of linear models, linear value-function approximation, and
feature selection for reinforcement learning”. In Proceedings of the 25th international conference on Machine learning. 2008.
[Parmas+19] Paavo Parmas, Carl Edward Rasmussen, Jan Peters, Kenji Doya. “PIPPS: Flexible Model-Based Policy Search Robust to the Curse of Chaos”. https://
arxiv.org/abs/1902.01240
[Peng+19] Xue Bin Peng, Aviral Kumar, Grace Zhang, Sergey Levine. "Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning”. https://
arxiv.org/abs/1910.00177
[Schölkopf19] Bernhard Schölkopf. “Causality for machine learning”. https://arxiv.org/abs/1911.10500
[Siegel+20] Noah Y. Siegel, Jost Tobias Springenberg, Felix Berkenkamp, Abbas Abdolmaleki, Michael Neunert, Thomas Lampe, Roland Hafner, Nicolas Heess, Martin
Riedmiller. “Keep Doing What Worked: Behavioral Modelling Priors for Offline Reinforcement Learning”. https://arxiv.org/abs/2002.08396
[Sinha+17] Aman Sinha, Hongseok Namkoong, Riccardo Volpi, John Duchi. “Certifying Some Distributional Robustness with Principled Adversarial Training”. https://
arxiv.org/abs/1710.10571
[Silver+14] David Silver. Guy Lever. Nicolas Heess. Thomas Degris. Daan Wierstra. Martin Riedmiller. ”Deterministic policy gradient algorithms”. Proceedings of the 31st
International Conference on International Conference on Machine Learning. 2014.
[Sutton+91] Dyna, an integrated architecture for learning, planning, and reacting”. ACM Sigart Bulletin, 2(4), 160-163. 1991.
[Tassa+12] Yuval Tassa, Tom Erez, Emanuel Todorov. "Synthesis and stabilization of complex behaviors through online trajectory optimization." 2012 IEEE/RSJ
International Conference on Intelligent Robots and Systems. 2012.
[Tang+19] Ziyang Tang, Yihao Feng, Lihong Li, Dengyong Zhou, Qiang Liu. “Doubly Robust Bias Reduction in Infinite Horizon Off-Policy Estimation”. https://arxiv.org/abs/
1910.07186
[Thomas+15] Philip S. Thomas, Georgios Theocharous, Mohammad Ghavamzadeh. "High-confidence off-policy evaluation”. Twenty-Ninth AAAI Conference on Artificial
Intelligence. 2015.
[Uehara+19] Masatoshi Uehara, Jiawei Huang, Nan Jiang. “Minimax Weight and Q-Function Learning for Off-Policy Evaluation”, https://arxiv.org/abs/1910.12809
[Vanseijen+15] Harm Vanseijen, Rich Sutton. "A deeper look at planning as learning from replay." In International conference on machine learning. 2015.
[Wu+19] Yifan Wu, George Tucker, Ofir Nachum. “Behavior Regularized Offline Reinforcement Learning”. https://arxiv.org/abs/1911.11361
[Yu+20*] Tianhe Yu, Garrett Thomas, Lantao Yu, Stefano Ermon, James Zou, Sergey Levine, Chelsea Finn, Tengyu Ma. “MOPO: Model-based Offline Policy Optimization”.
https://arxiv.org/abs/2005.13239
[Zhang+20] Ruiyi Zhang, Bo Dai, Lihong Li, Dale Schuurmans. “GenDICE: Generalized Offline Estimation of Stationary Values”. https://arxiv.org/abs/2002.09072 46
47

More Related Content

What's hot

[DL輪読会]Learning Latent Dynamics for Planning from Pixels
[DL輪読会]Learning Latent Dynamics for Planning from Pixels[DL輪読会]Learning Latent Dynamics for Planning from Pixels
[DL輪読会]Learning Latent Dynamics for Planning from Pixels
Deep Learning JP
 
[DL輪読会]Control as Inferenceと発展
[DL輪読会]Control as Inferenceと発展[DL輪読会]Control as Inferenceと発展
[DL輪読会]Control as Inferenceと発展
Deep Learning JP
 
[DL輪読会]“SimPLe”,“Improved Dynamics Model”,“PlaNet” 近年のVAEベース系列モデルの進展とそのモデルベース...
[DL輪読会]“SimPLe”,“Improved Dynamics Model”,“PlaNet” 近年のVAEベース系列モデルの進展とそのモデルベース...[DL輪読会]“SimPLe”,“Improved Dynamics Model”,“PlaNet” 近年のVAEベース系列モデルの進展とそのモデルベース...
[DL輪読会]“SimPLe”,“Improved Dynamics Model”,“PlaNet” 近年のVAEベース系列モデルの進展とそのモデルベース...
Deep Learning JP
 
強化学習の基礎と深層強化学習(東京大学 松尾研究室 深層強化学習サマースクール講義資料)
強化学習の基礎と深層強化学習(東京大学 松尾研究室 深層強化学習サマースクール講義資料)強化学習の基礎と深層強化学習(東京大学 松尾研究室 深層強化学習サマースクール講義資料)
強化学習の基礎と深層強化学習(東京大学 松尾研究室 深層強化学習サマースクール講義資料)
Shota Imai
 
深層生成モデルと世界モデル(2020/11/20版)
深層生成モデルと世界モデル(2020/11/20版)深層生成モデルと世界モデル(2020/11/20版)
深層生成モデルと世界モデル(2020/11/20版)
Masahiro Suzuki
 
【DL輪読会】Contrastive Learning as Goal-Conditioned Reinforcement Learning
【DL輪読会】Contrastive Learning as Goal-Conditioned Reinforcement Learning【DL輪読会】Contrastive Learning as Goal-Conditioned Reinforcement Learning
【DL輪読会】Contrastive Learning as Goal-Conditioned Reinforcement Learning
Deep Learning JP
 
「世界モデル」と関連研究について
「世界モデル」と関連研究について「世界モデル」と関連研究について
「世界モデル」と関連研究について
Masahiro Suzuki
 
[DL輪読会]相互情報量最大化による表現学習
[DL輪読会]相互情報量最大化による表現学習[DL輪読会]相互情報量最大化による表現学習
[DL輪読会]相互情報量最大化による表現学習
Deep Learning JP
 
[DL輪読会]`強化学習のための状態表現学習 -より良い「世界モデル」の獲得に向けて-
[DL輪読会]`強化学習のための状態表現学習 -より良い「世界モデル」の獲得に向けて-[DL輪読会]`強化学習のための状態表現学習 -より良い「世界モデル」の獲得に向けて-
[DL輪読会]`強化学習のための状態表現学習 -より良い「世界モデル」の獲得に向けて-
Deep Learning JP
 
強化学習その3
強化学習その3強化学習その3
強化学習その3
nishio
 
強化学習エージェントの内発的動機付けによる探索とその応用(第4回 統計・機械学習若手シンポジウム 招待公演)
強化学習エージェントの内発的動機付けによる探索とその応用(第4回 統計・機械学習若手シンポジウム 招待公演)強化学習エージェントの内発的動機付けによる探索とその応用(第4回 統計・機械学習若手シンポジウム 招待公演)
強化学習エージェントの内発的動機付けによる探索とその応用(第4回 統計・機械学習若手シンポジウム 招待公演)
Shota Imai
 
[DL輪読会]近年のエネルギーベースモデルの進展
[DL輪読会]近年のエネルギーベースモデルの進展[DL輪読会]近年のエネルギーベースモデルの進展
[DL輪読会]近年のエネルギーベースモデルの進展
Deep Learning JP
 
強化学習と逆強化学習を組み合わせた模倣学習
強化学習と逆強化学習を組み合わせた模倣学習強化学習と逆強化学習を組み合わせた模倣学習
強化学習と逆強化学習を組み合わせた模倣学習
Eiji Uchibe
 
【DL輪読会】論文解説:Offline Reinforcement Learning as One Big Sequence Modeling Problem
【DL輪読会】論文解説:Offline Reinforcement Learning as One Big Sequence Modeling Problem【DL輪読会】論文解説:Offline Reinforcement Learning as One Big Sequence Modeling Problem
【DL輪読会】論文解説:Offline Reinforcement Learning as One Big Sequence Modeling Problem
Deep Learning JP
 
深層強化学習の分散化・RNN利用の動向〜R2D2の紹介をもとに〜
深層強化学習の分散化・RNN利用の動向〜R2D2の紹介をもとに〜深層強化学習の分散化・RNN利用の動向〜R2D2の紹介をもとに〜
深層強化学習の分散化・RNN利用の動向〜R2D2の紹介をもとに〜
Jun Okumura
 
報酬設計と逆強化学習
報酬設計と逆強化学習報酬設計と逆強化学習
報酬設計と逆強化学習
Yusuke Nakata
 
Variational AutoEncoder
Variational AutoEncoderVariational AutoEncoder
Variational AutoEncoder
Kazuki Nitta
 
SSII2021 [OS2-01] 転移学習の基礎:異なるタスクの知識を利用するための機械学習の方法
SSII2021 [OS2-01] 転移学習の基礎:異なるタスクの知識を利用するための機械学習の方法SSII2021 [OS2-01] 転移学習の基礎:異なるタスクの知識を利用するための機械学習の方法
SSII2021 [OS2-01] 転移学習の基礎:異なるタスクの知識を利用するための機械学習の方法
SSII
 
【DL輪読会】マルチエージェント強化学習における近年の 協調的方策学習アルゴリズムの発展
【DL輪読会】マルチエージェント強化学習における近年の 協調的方策学習アルゴリズムの発展【DL輪読会】マルチエージェント強化学習における近年の 協調的方策学習アルゴリズムの発展
【DL輪読会】マルチエージェント強化学習における近年の 協調的方策学習アルゴリズムの発展
Deep Learning JP
 
【DL輪読会】DayDreamer: World Models for Physical Robot Learning
【DL輪読会】DayDreamer: World Models for Physical Robot Learning【DL輪読会】DayDreamer: World Models for Physical Robot Learning
【DL輪読会】DayDreamer: World Models for Physical Robot Learning
Deep Learning JP
 

What's hot (20)

[DL輪読会]Learning Latent Dynamics for Planning from Pixels
[DL輪読会]Learning Latent Dynamics for Planning from Pixels[DL輪読会]Learning Latent Dynamics for Planning from Pixels
[DL輪読会]Learning Latent Dynamics for Planning from Pixels
 
[DL輪読会]Control as Inferenceと発展
[DL輪読会]Control as Inferenceと発展[DL輪読会]Control as Inferenceと発展
[DL輪読会]Control as Inferenceと発展
 
[DL輪読会]“SimPLe”,“Improved Dynamics Model”,“PlaNet” 近年のVAEベース系列モデルの進展とそのモデルベース...
[DL輪読会]“SimPLe”,“Improved Dynamics Model”,“PlaNet” 近年のVAEベース系列モデルの進展とそのモデルベース...[DL輪読会]“SimPLe”,“Improved Dynamics Model”,“PlaNet” 近年のVAEベース系列モデルの進展とそのモデルベース...
[DL輪読会]“SimPLe”,“Improved Dynamics Model”,“PlaNet” 近年のVAEベース系列モデルの進展とそのモデルベース...
 
強化学習の基礎と深層強化学習(東京大学 松尾研究室 深層強化学習サマースクール講義資料)
強化学習の基礎と深層強化学習(東京大学 松尾研究室 深層強化学習サマースクール講義資料)強化学習の基礎と深層強化学習(東京大学 松尾研究室 深層強化学習サマースクール講義資料)
強化学習の基礎と深層強化学習(東京大学 松尾研究室 深層強化学習サマースクール講義資料)
 
深層生成モデルと世界モデル(2020/11/20版)
深層生成モデルと世界モデル(2020/11/20版)深層生成モデルと世界モデル(2020/11/20版)
深層生成モデルと世界モデル(2020/11/20版)
 
【DL輪読会】Contrastive Learning as Goal-Conditioned Reinforcement Learning
【DL輪読会】Contrastive Learning as Goal-Conditioned Reinforcement Learning【DL輪読会】Contrastive Learning as Goal-Conditioned Reinforcement Learning
【DL輪読会】Contrastive Learning as Goal-Conditioned Reinforcement Learning
 
「世界モデル」と関連研究について
「世界モデル」と関連研究について「世界モデル」と関連研究について
「世界モデル」と関連研究について
 
[DL輪読会]相互情報量最大化による表現学習
[DL輪読会]相互情報量最大化による表現学習[DL輪読会]相互情報量最大化による表現学習
[DL輪読会]相互情報量最大化による表現学習
 
[DL輪読会]`強化学習のための状態表現学習 -より良い「世界モデル」の獲得に向けて-
[DL輪読会]`強化学習のための状態表現学習 -より良い「世界モデル」の獲得に向けて-[DL輪読会]`強化学習のための状態表現学習 -より良い「世界モデル」の獲得に向けて-
[DL輪読会]`強化学習のための状態表現学習 -より良い「世界モデル」の獲得に向けて-
 
強化学習その3
強化学習その3強化学習その3
強化学習その3
 
強化学習エージェントの内発的動機付けによる探索とその応用(第4回 統計・機械学習若手シンポジウム 招待公演)
強化学習エージェントの内発的動機付けによる探索とその応用(第4回 統計・機械学習若手シンポジウム 招待公演)強化学習エージェントの内発的動機付けによる探索とその応用(第4回 統計・機械学習若手シンポジウム 招待公演)
強化学習エージェントの内発的動機付けによる探索とその応用(第4回 統計・機械学習若手シンポジウム 招待公演)
 
[DL輪読会]近年のエネルギーベースモデルの進展
[DL輪読会]近年のエネルギーベースモデルの進展[DL輪読会]近年のエネルギーベースモデルの進展
[DL輪読会]近年のエネルギーベースモデルの進展
 
強化学習と逆強化学習を組み合わせた模倣学習
強化学習と逆強化学習を組み合わせた模倣学習強化学習と逆強化学習を組み合わせた模倣学習
強化学習と逆強化学習を組み合わせた模倣学習
 
【DL輪読会】論文解説:Offline Reinforcement Learning as One Big Sequence Modeling Problem
【DL輪読会】論文解説:Offline Reinforcement Learning as One Big Sequence Modeling Problem【DL輪読会】論文解説:Offline Reinforcement Learning as One Big Sequence Modeling Problem
【DL輪読会】論文解説:Offline Reinforcement Learning as One Big Sequence Modeling Problem
 
深層強化学習の分散化・RNN利用の動向〜R2D2の紹介をもとに〜
深層強化学習の分散化・RNN利用の動向〜R2D2の紹介をもとに〜深層強化学習の分散化・RNN利用の動向〜R2D2の紹介をもとに〜
深層強化学習の分散化・RNN利用の動向〜R2D2の紹介をもとに〜
 
報酬設計と逆強化学習
報酬設計と逆強化学習報酬設計と逆強化学習
報酬設計と逆強化学習
 
Variational AutoEncoder
Variational AutoEncoderVariational AutoEncoder
Variational AutoEncoder
 
SSII2021 [OS2-01] 転移学習の基礎:異なるタスクの知識を利用するための機械学習の方法
SSII2021 [OS2-01] 転移学習の基礎:異なるタスクの知識を利用するための機械学習の方法SSII2021 [OS2-01] 転移学習の基礎:異なるタスクの知識を利用するための機械学習の方法
SSII2021 [OS2-01] 転移学習の基礎:異なるタスクの知識を利用するための機械学習の方法
 
【DL輪読会】マルチエージェント強化学習における近年の 協調的方策学習アルゴリズムの発展
【DL輪読会】マルチエージェント強化学習における近年の 協調的方策学習アルゴリズムの発展【DL輪読会】マルチエージェント強化学習における近年の 協調的方策学習アルゴリズムの発展
【DL輪読会】マルチエージェント強化学習における近年の 協調的方策学習アルゴリズムの発展
 
【DL輪読会】DayDreamer: World Models for Physical Robot Learning
【DL輪読会】DayDreamer: World Models for Physical Robot Learning【DL輪読会】DayDreamer: World Models for Physical Robot Learning
【DL輪読会】DayDreamer: World Models for Physical Robot Learning
 

Similar to [DL輪読会]近年のオフライン強化学習のまとめ —Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems—

[DL輪読会]Hindsight Experience Replayを応用した再ラベリングによる効率的な強化学習
[DL輪読会]Hindsight Experience Replayを応用した再ラベリングによる効率的な強化学習[DL輪読会]Hindsight Experience Replayを応用した再ラベリングによる効率的な強化学習
[DL輪読会]Hindsight Experience Replayを応用した再ラベリングによる効率的な強化学習
Deep Learning JP
 
[DL輪読会]Recent Advances in Autoencoder-Based Representation Learning
[DL輪読会]Recent Advances in Autoencoder-Based Representation Learning[DL輪読会]Recent Advances in Autoencoder-Based Representation Learning
[DL輪読会]Recent Advances in Autoencoder-Based Representation Learning
Deep Learning JP
 
Efficient Scalar Multiplication for Ate Based Pairing over KSS Curve of Embed...
Efficient Scalar Multiplication for Ate Based Pairing over KSS Curve of Embed...Efficient Scalar Multiplication for Ate Based Pairing over KSS Curve of Embed...
Efficient Scalar Multiplication for Ate Based Pairing over KSS Curve of Embed...
Md. Al-Amin Khandaker Nipu
 
Ali-Krakow-2015
Ali-Krakow-2015Ali-Krakow-2015
Ali-Krakow-2015Ahmed Ali
 
Computing the Nucleon Spin from Lattice QCD
Computing the Nucleon Spin from Lattice QCDComputing the Nucleon Spin from Lattice QCD
Computing the Nucleon Spin from Lattice QCD
Christos Kallidonis
 
All Pairs-Shortest Path (Fast Floyd-Warshall) Code
All Pairs-Shortest Path (Fast Floyd-Warshall) Code All Pairs-Shortest Path (Fast Floyd-Warshall) Code
All Pairs-Shortest Path (Fast Floyd-Warshall) Code
Ehsan Sharifi
 
Dynamic programming
Dynamic programmingDynamic programming
Dynamic programming
Shakil Ahmed
 
module4_dynamic programming_2022.pdf
module4_dynamic programming_2022.pdfmodule4_dynamic programming_2022.pdf
module4_dynamic programming_2022.pdf
Shiwani Gupta
 
Nearly optimal average case complexity of counting bicliques under seth
Nearly optimal average case complexity of counting bicliques under sethNearly optimal average case complexity of counting bicliques under seth
Nearly optimal average case complexity of counting bicliques under seth
Nobutaka Shimizu
 
Factorized Asymptotic Bayesian Inference for Latent Feature Models
Factorized Asymptotic Bayesian Inference for Latent Feature ModelsFactorized Asymptotic Bayesian Inference for Latent Feature Models
Factorized Asymptotic Bayesian Inference for Latent Feature Models
Kohei Hayashi
 
Quantum factorization.pdf
Quantum factorization.pdfQuantum factorization.pdf
Quantum factorization.pdf
ssuser8b461f
 
Hierarchical Reinforcement Learning with Option-Critic Architecture
Hierarchical Reinforcement Learning with Option-Critic ArchitectureHierarchical Reinforcement Learning with Option-Critic Architecture
Hierarchical Reinforcement Learning with Option-Critic Architecture
Necip Oguz Serbetci
 
【DL輪読会】Unbiased Gradient Estimation for Marginal Log-likelihood
【DL輪読会】Unbiased Gradient Estimation for Marginal Log-likelihood【DL輪読会】Unbiased Gradient Estimation for Marginal Log-likelihood
【DL輪読会】Unbiased Gradient Estimation for Marginal Log-likelihood
Deep Learning JP
 
TRICK
TRICKTRICK
TRICK
Jimmy Ngu
 
Using blurred images to assess damage in bridge structures?
Using blurred images to assess damage in bridge structures?Using blurred images to assess damage in bridge structures?
Using blurred images to assess damage in bridge structures?
Alessandro Palmeri
 
All You Need is Fold
All You Need is FoldAll You Need is Fold
All You Need is Fold
Mike Harris
 
Big datacourse
Big datacourseBig datacourse
Big datacourse
Massimiliano Ruocco
 
OpenOpt の線形計画で圧縮センシング
OpenOpt の線形計画で圧縮センシングOpenOpt の線形計画で圧縮センシング
OpenOpt の線形計画で圧縮センシング
Toshihiro Kamishima
 
Hiroaki Shiokawa
Hiroaki ShiokawaHiroaki Shiokawa
Hiroaki Shiokawa
Suurist
 
Oracle-based algorithms for high-dimensional polytopes.
Oracle-based algorithms for high-dimensional polytopes.Oracle-based algorithms for high-dimensional polytopes.
Oracle-based algorithms for high-dimensional polytopes.
Vissarion Fisikopoulos
 

Similar to [DL輪読会]近年のオフライン強化学習のまとめ —Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems— (20)

[DL輪読会]Hindsight Experience Replayを応用した再ラベリングによる効率的な強化学習
[DL輪読会]Hindsight Experience Replayを応用した再ラベリングによる効率的な強化学習[DL輪読会]Hindsight Experience Replayを応用した再ラベリングによる効率的な強化学習
[DL輪読会]Hindsight Experience Replayを応用した再ラベリングによる効率的な強化学習
 
[DL輪読会]Recent Advances in Autoencoder-Based Representation Learning
[DL輪読会]Recent Advances in Autoencoder-Based Representation Learning[DL輪読会]Recent Advances in Autoencoder-Based Representation Learning
[DL輪読会]Recent Advances in Autoencoder-Based Representation Learning
 
Efficient Scalar Multiplication for Ate Based Pairing over KSS Curve of Embed...
Efficient Scalar Multiplication for Ate Based Pairing over KSS Curve of Embed...Efficient Scalar Multiplication for Ate Based Pairing over KSS Curve of Embed...
Efficient Scalar Multiplication for Ate Based Pairing over KSS Curve of Embed...
 
Ali-Krakow-2015
Ali-Krakow-2015Ali-Krakow-2015
Ali-Krakow-2015
 
Computing the Nucleon Spin from Lattice QCD
Computing the Nucleon Spin from Lattice QCDComputing the Nucleon Spin from Lattice QCD
Computing the Nucleon Spin from Lattice QCD
 
All Pairs-Shortest Path (Fast Floyd-Warshall) Code
All Pairs-Shortest Path (Fast Floyd-Warshall) Code All Pairs-Shortest Path (Fast Floyd-Warshall) Code
All Pairs-Shortest Path (Fast Floyd-Warshall) Code
 
Dynamic programming
Dynamic programmingDynamic programming
Dynamic programming
 
module4_dynamic programming_2022.pdf
module4_dynamic programming_2022.pdfmodule4_dynamic programming_2022.pdf
module4_dynamic programming_2022.pdf
 
Nearly optimal average case complexity of counting bicliques under seth
Nearly optimal average case complexity of counting bicliques under sethNearly optimal average case complexity of counting bicliques under seth
Nearly optimal average case complexity of counting bicliques under seth
 
Factorized Asymptotic Bayesian Inference for Latent Feature Models
Factorized Asymptotic Bayesian Inference for Latent Feature ModelsFactorized Asymptotic Bayesian Inference for Latent Feature Models
Factorized Asymptotic Bayesian Inference for Latent Feature Models
 
Quantum factorization.pdf
Quantum factorization.pdfQuantum factorization.pdf
Quantum factorization.pdf
 
Hierarchical Reinforcement Learning with Option-Critic Architecture
Hierarchical Reinforcement Learning with Option-Critic ArchitectureHierarchical Reinforcement Learning with Option-Critic Architecture
Hierarchical Reinforcement Learning with Option-Critic Architecture
 
【DL輪読会】Unbiased Gradient Estimation for Marginal Log-likelihood
【DL輪読会】Unbiased Gradient Estimation for Marginal Log-likelihood【DL輪読会】Unbiased Gradient Estimation for Marginal Log-likelihood
【DL輪読会】Unbiased Gradient Estimation for Marginal Log-likelihood
 
TRICK
TRICKTRICK
TRICK
 
Using blurred images to assess damage in bridge structures?
Using blurred images to assess damage in bridge structures?Using blurred images to assess damage in bridge structures?
Using blurred images to assess damage in bridge structures?
 
All You Need is Fold
All You Need is FoldAll You Need is Fold
All You Need is Fold
 
Big datacourse
Big datacourseBig datacourse
Big datacourse
 
OpenOpt の線形計画で圧縮センシング
OpenOpt の線形計画で圧縮センシングOpenOpt の線形計画で圧縮センシング
OpenOpt の線形計画で圧縮センシング
 
Hiroaki Shiokawa
Hiroaki ShiokawaHiroaki Shiokawa
Hiroaki Shiokawa
 
Oracle-based algorithms for high-dimensional polytopes.
Oracle-based algorithms for high-dimensional polytopes.Oracle-based algorithms for high-dimensional polytopes.
Oracle-based algorithms for high-dimensional polytopes.
 

More from Deep Learning JP

【DL輪読会】AdaptDiffuser: Diffusion Models as Adaptive Self-evolving Planners
【DL輪読会】AdaptDiffuser: Diffusion Models as Adaptive Self-evolving Planners【DL輪読会】AdaptDiffuser: Diffusion Models as Adaptive Self-evolving Planners
【DL輪読会】AdaptDiffuser: Diffusion Models as Adaptive Self-evolving Planners
Deep Learning JP
 
【DL輪読会】事前学習用データセットについて
【DL輪読会】事前学習用データセットについて【DL輪読会】事前学習用データセットについて
【DL輪読会】事前学習用データセットについて
Deep Learning JP
 
【DL輪読会】 "Learning to render novel views from wide-baseline stereo pairs." CVP...
【DL輪読会】 "Learning to render novel views from wide-baseline stereo pairs." CVP...【DL輪読会】 "Learning to render novel views from wide-baseline stereo pairs." CVP...
【DL輪読会】 "Learning to render novel views from wide-baseline stereo pairs." CVP...
Deep Learning JP
 
【DL輪読会】Zero-Shot Dual-Lens Super-Resolution
【DL輪読会】Zero-Shot Dual-Lens Super-Resolution【DL輪読会】Zero-Shot Dual-Lens Super-Resolution
【DL輪読会】Zero-Shot Dual-Lens Super-Resolution
Deep Learning JP
 
【DL輪読会】BloombergGPT: A Large Language Model for Finance arxiv
【DL輪読会】BloombergGPT: A Large Language Model for Finance arxiv【DL輪読会】BloombergGPT: A Large Language Model for Finance arxiv
【DL輪読会】BloombergGPT: A Large Language Model for Finance arxiv
Deep Learning JP
 
【DL輪読会】マルチモーダル LLM
【DL輪読会】マルチモーダル LLM【DL輪読会】マルチモーダル LLM
【DL輪読会】マルチモーダル LLM
Deep Learning JP
 
【 DL輪読会】ToolLLM: Facilitating Large Language Models to Master 16000+ Real-wo...
 【 DL輪読会】ToolLLM: Facilitating Large Language Models to Master 16000+ Real-wo... 【 DL輪読会】ToolLLM: Facilitating Large Language Models to Master 16000+ Real-wo...
【 DL輪読会】ToolLLM: Facilitating Large Language Models to Master 16000+ Real-wo...
Deep Learning JP
 
【DL輪読会】AnyLoc: Towards Universal Visual Place Recognition
【DL輪読会】AnyLoc: Towards Universal Visual Place Recognition【DL輪読会】AnyLoc: Towards Universal Visual Place Recognition
【DL輪読会】AnyLoc: Towards Universal Visual Place Recognition
Deep Learning JP
 
【DL輪読会】Can Neural Network Memorization Be Localized?
【DL輪読会】Can Neural Network Memorization Be Localized?【DL輪読会】Can Neural Network Memorization Be Localized?
【DL輪読会】Can Neural Network Memorization Be Localized?
Deep Learning JP
 
【DL輪読会】Hopfield network 関連研究について
【DL輪読会】Hopfield network 関連研究について【DL輪読会】Hopfield network 関連研究について
【DL輪読会】Hopfield network 関連研究について
Deep Learning JP
 
【DL輪読会】SimPer: Simple self-supervised learning of periodic targets( ICLR 2023 )
【DL輪読会】SimPer: Simple self-supervised learning of periodic targets( ICLR 2023 )【DL輪読会】SimPer: Simple self-supervised learning of periodic targets( ICLR 2023 )
【DL輪読会】SimPer: Simple self-supervised learning of periodic targets( ICLR 2023 )
Deep Learning JP
 
【DL輪読会】RLCD: Reinforcement Learning from Contrast Distillation for Language M...
【DL輪読会】RLCD: Reinforcement Learning from Contrast Distillation for Language M...【DL輪読会】RLCD: Reinforcement Learning from Contrast Distillation for Language M...
【DL輪読会】RLCD: Reinforcement Learning from Contrast Distillation for Language M...
Deep Learning JP
 
【DL輪読会】"Secrets of RLHF in Large Language Models Part I: PPO"
【DL輪読会】"Secrets of RLHF in Large Language Models Part I: PPO"【DL輪読会】"Secrets of RLHF in Large Language Models Part I: PPO"
【DL輪読会】"Secrets of RLHF in Large Language Models Part I: PPO"
Deep Learning JP
 
【DL輪読会】"Language Instructed Reinforcement Learning for Human-AI Coordination "
【DL輪読会】"Language Instructed Reinforcement Learning  for Human-AI Coordination "【DL輪読会】"Language Instructed Reinforcement Learning  for Human-AI Coordination "
【DL輪読会】"Language Instructed Reinforcement Learning for Human-AI Coordination "
Deep Learning JP
 
【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models
【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models
【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models
Deep Learning JP
 
【DL輪読会】"Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware"
【DL輪読会】"Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware"【DL輪読会】"Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware"
【DL輪読会】"Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware"
Deep Learning JP
 
【DL輪読会】Parameter is Not All You Need:Starting from Non-Parametric Networks fo...
【DL輪読会】Parameter is Not All You Need:Starting from Non-Parametric Networks fo...【DL輪読会】Parameter is Not All You Need:Starting from Non-Parametric Networks fo...
【DL輪読会】Parameter is Not All You Need:Starting from Non-Parametric Networks fo...
Deep Learning JP
 
【DL輪読会】Drag Your GAN: Interactive Point-based Manipulation on the Generative ...
【DL輪読会】Drag Your GAN: Interactive Point-based Manipulation on the Generative ...【DL輪読会】Drag Your GAN: Interactive Point-based Manipulation on the Generative ...
【DL輪読会】Drag Your GAN: Interactive Point-based Manipulation on the Generative ...
Deep Learning JP
 
【DL輪読会】Self-Supervised Learning from Images with a Joint-Embedding Predictive...
【DL輪読会】Self-Supervised Learning from Images with a Joint-Embedding Predictive...【DL輪読会】Self-Supervised Learning from Images with a Joint-Embedding Predictive...
【DL輪読会】Self-Supervised Learning from Images with a Joint-Embedding Predictive...
Deep Learning JP
 
【DL輪読会】Towards Understanding Ensemble, Knowledge Distillation and Self-Distil...
【DL輪読会】Towards Understanding Ensemble, Knowledge Distillation and Self-Distil...【DL輪読会】Towards Understanding Ensemble, Knowledge Distillation and Self-Distil...
【DL輪読会】Towards Understanding Ensemble, Knowledge Distillation and Self-Distil...
Deep Learning JP
 

More from Deep Learning JP (20)

【DL輪読会】AdaptDiffuser: Diffusion Models as Adaptive Self-evolving Planners
【DL輪読会】AdaptDiffuser: Diffusion Models as Adaptive Self-evolving Planners【DL輪読会】AdaptDiffuser: Diffusion Models as Adaptive Self-evolving Planners
【DL輪読会】AdaptDiffuser: Diffusion Models as Adaptive Self-evolving Planners
 
【DL輪読会】事前学習用データセットについて
【DL輪読会】事前学習用データセットについて【DL輪読会】事前学習用データセットについて
【DL輪読会】事前学習用データセットについて
 
【DL輪読会】 "Learning to render novel views from wide-baseline stereo pairs." CVP...
【DL輪読会】 "Learning to render novel views from wide-baseline stereo pairs." CVP...【DL輪読会】 "Learning to render novel views from wide-baseline stereo pairs." CVP...
【DL輪読会】 "Learning to render novel views from wide-baseline stereo pairs." CVP...
 
【DL輪読会】Zero-Shot Dual-Lens Super-Resolution
【DL輪読会】Zero-Shot Dual-Lens Super-Resolution【DL輪読会】Zero-Shot Dual-Lens Super-Resolution
【DL輪読会】Zero-Shot Dual-Lens Super-Resolution
 
【DL輪読会】BloombergGPT: A Large Language Model for Finance arxiv
【DL輪読会】BloombergGPT: A Large Language Model for Finance arxiv【DL輪読会】BloombergGPT: A Large Language Model for Finance arxiv
【DL輪読会】BloombergGPT: A Large Language Model for Finance arxiv
 
【DL輪読会】マルチモーダル LLM
【DL輪読会】マルチモーダル LLM【DL輪読会】マルチモーダル LLM
【DL輪読会】マルチモーダル LLM
 
【 DL輪読会】ToolLLM: Facilitating Large Language Models to Master 16000+ Real-wo...
 【 DL輪読会】ToolLLM: Facilitating Large Language Models to Master 16000+ Real-wo... 【 DL輪読会】ToolLLM: Facilitating Large Language Models to Master 16000+ Real-wo...
【 DL輪読会】ToolLLM: Facilitating Large Language Models to Master 16000+ Real-wo...
 
【DL輪読会】AnyLoc: Towards Universal Visual Place Recognition
【DL輪読会】AnyLoc: Towards Universal Visual Place Recognition【DL輪読会】AnyLoc: Towards Universal Visual Place Recognition
【DL輪読会】AnyLoc: Towards Universal Visual Place Recognition
 
【DL輪読会】Can Neural Network Memorization Be Localized?
【DL輪読会】Can Neural Network Memorization Be Localized?【DL輪読会】Can Neural Network Memorization Be Localized?
【DL輪読会】Can Neural Network Memorization Be Localized?
 
【DL輪読会】Hopfield network 関連研究について
【DL輪読会】Hopfield network 関連研究について【DL輪読会】Hopfield network 関連研究について
【DL輪読会】Hopfield network 関連研究について
 
【DL輪読会】SimPer: Simple self-supervised learning of periodic targets( ICLR 2023 )
【DL輪読会】SimPer: Simple self-supervised learning of periodic targets( ICLR 2023 )【DL輪読会】SimPer: Simple self-supervised learning of periodic targets( ICLR 2023 )
【DL輪読会】SimPer: Simple self-supervised learning of periodic targets( ICLR 2023 )
 
【DL輪読会】RLCD: Reinforcement Learning from Contrast Distillation for Language M...
【DL輪読会】RLCD: Reinforcement Learning from Contrast Distillation for Language M...【DL輪読会】RLCD: Reinforcement Learning from Contrast Distillation for Language M...
【DL輪読会】RLCD: Reinforcement Learning from Contrast Distillation for Language M...
 
【DL輪読会】"Secrets of RLHF in Large Language Models Part I: PPO"
【DL輪読会】"Secrets of RLHF in Large Language Models Part I: PPO"【DL輪読会】"Secrets of RLHF in Large Language Models Part I: PPO"
【DL輪読会】"Secrets of RLHF in Large Language Models Part I: PPO"
 
【DL輪読会】"Language Instructed Reinforcement Learning for Human-AI Coordination "
【DL輪読会】"Language Instructed Reinforcement Learning  for Human-AI Coordination "【DL輪読会】"Language Instructed Reinforcement Learning  for Human-AI Coordination "
【DL輪読会】"Language Instructed Reinforcement Learning for Human-AI Coordination "
 
【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models
【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models
【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models
 
【DL輪読会】"Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware"
【DL輪読会】"Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware"【DL輪読会】"Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware"
【DL輪読会】"Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware"
 
【DL輪読会】Parameter is Not All You Need:Starting from Non-Parametric Networks fo...
【DL輪読会】Parameter is Not All You Need:Starting from Non-Parametric Networks fo...【DL輪読会】Parameter is Not All You Need:Starting from Non-Parametric Networks fo...
【DL輪読会】Parameter is Not All You Need:Starting from Non-Parametric Networks fo...
 
【DL輪読会】Drag Your GAN: Interactive Point-based Manipulation on the Generative ...
【DL輪読会】Drag Your GAN: Interactive Point-based Manipulation on the Generative ...【DL輪読会】Drag Your GAN: Interactive Point-based Manipulation on the Generative ...
【DL輪読会】Drag Your GAN: Interactive Point-based Manipulation on the Generative ...
 
【DL輪読会】Self-Supervised Learning from Images with a Joint-Embedding Predictive...
【DL輪読会】Self-Supervised Learning from Images with a Joint-Embedding Predictive...【DL輪読会】Self-Supervised Learning from Images with a Joint-Embedding Predictive...
【DL輪読会】Self-Supervised Learning from Images with a Joint-Embedding Predictive...
 
【DL輪読会】Towards Understanding Ensemble, Knowledge Distillation and Self-Distil...
【DL輪読会】Towards Understanding Ensemble, Knowledge Distillation and Self-Distil...【DL輪読会】Towards Understanding Ensemble, Knowledge Distillation and Self-Distil...
【DL輪読会】Towards Understanding Ensemble, Knowledge Distillation and Self-Distil...
 

Recently uploaded

Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
Abida Shariff
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 

Recently uploaded (20)

Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 

[DL輪読会]近年のオフライン強化学習のまとめ —Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems—

  • 1. 1 —Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems— 2020.06.26 Presenter: Tatsuya Matsushima @__tmats__ , Matsuo Lab
  • 2. RL • 
 • • … • off-policy RL 
 but distributional shift • 2
  • 3. Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems • Sergey Levine, Aviral Kumar, George Tucker, Justin Fu • UC Google Brain • https://arxiv.org/abs/2005.01643 (Submitted on 4 May 2020) • Sergey Levine • Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review • https://arxiv.org/abs/1805.00909 • RL • • • RLArch 3
  • 4. 1. 2. RL 3. RL • 4. DP RL 5. RL • 6. 7. ※ * 4
  • 6. MDP • • • • • • • POMDP 
 • Horizon • • ℳ = ( 𝒮, 𝒜, T, d0, r, γ) s ∈ 𝒮 a ∈ 𝒜 T(st+1 |st, at) d0(s0) r : 𝒮 × 𝒜 → ℝ γ ∈ (0,1] π(at |st) τ = (s0, a0, ⋯, sH, aH) H pπ(τ) = d0(s0) H ∏ t=0 π(at |st)T(st+1 |st, at) J(π) = 𝔼τ∼pπ(τ) [∑ H t=0 γt r (st, at)] t s dt(st) s d(s) dt(st) 6
  • 7. REINFORCE • πθ(at |st) ∇θJ (πθ) = 𝔼τ∼pπθ (τ) H ∑ t=0 γt ∇θlog πθ (at ∣ st) ( H ∑ t′=t γt′−t r (st′, at′) − b (st) ) = H ∑ t=0 𝔼st∼dπ t (st),at∼πθ(at ∣ st) [γt ∇θlog πθ (at ∣ st) ̂A (st, at)] 7 ̂A(st, at)
  • 8. DP Q-learning • Q Q 
 • Q-learning • • Vπ Qπ Vπ (st) = 𝔼τ∼pπ(τ ∣ st) [ H ∑ t′=t γt′−t r (st, at) ] Qπ (st, at) = 𝔼τ∼pπ(τ ∣ st, at) [ H ∑ t′=t γt′−t r (st, at) ] Qπ (st, at) = r (st, at) + γ𝔼st+1∼T(st+1 ∣ st, at),at+1∼π(at+1 ∣ st+1) [Qπ (st+1, at+1)] ℬπ ⃗Qπ = ℬπ ⃗Qπ Q⋆ (st, at) = r (st, at) + γ𝔼st+1∼T(st+1 ∣ st, at) [maxat+1 Q⋆ (st+1, at+1)] ℬ* ⃗Qπ = ℬ* ⃗Qπ 8
  • 9. Actor-Critic • 
 Q • 2 • Q • Q 
 πθ(at |st) Qπ (st, at) = r (st, at) + γ𝔼st+1∼T(st+1 ∣ st, at),at+1∼πθ(at+1 ∣ st+1) [Qπ (st+1, at+1)] 9
  • 10. On-policy vs Off-policy On-policy • • SARSA Off-policy • 
 • Q-learning experience replay 10
  • 12. RL • , • 𝒟 = {(si t, ai t, si t+1, ri t)} (s, a) ∈ 𝒟 s ∼ dπβ(s) a ∼ πβ(a|s) πβ 12
  • 13. [Lange+12] RL • batch RL RL • RL DQN • RL • pure-batch RL 
 • fully off-policy • fully • RL • 13
  • 14. RL • exploration • e bot • exploration 
 • 
 • QT-Opt[Kalashnikov+18] 14
  • 15. • 
 • • but open problem • counterfactual • • distributional shift • 
 𝒟 𝒟 15
  • 17. off-policy off-policy policy evaluation, OPE • [Thomas+15] • • • unbiased • self-normalize biased πθ J(πθ) J (πθ) = 𝔼τ∼πβ(τ) [ πθ(τ) πβ(τ) H ∑ t=0 γt r(s, a) ] = 𝔼τ∼πβ(τ) ( H ∏ t=0 πθ (at ∣ st) πβ (at ∣ st)) H ∑ t=0 γt r(s, a) ≈ n ∑ i=1 wi H H ∑ t=0 γt ri t wi t = H ∏ t=0 πθ (at ∣ st) πβ (at ∣ st) {(si t, ai t, ri t, si t+1, ⋯)} n i=1 πβ n n ∑ i=1 wi H 17
  • 18. off-policy off-policy policy evaluation, OPE • • • • self-normalization • Doubly robust estimator • • Q • or unbiased J (πθ) = 𝔼τ∼πβ(τ) [ ∑ H t=0 ( ∏ t t′=0 πθ(at ∣ st) πβ(at ∣ st) ) γt r(s, a) ] ≈ 1 n ∑ n i=1 ∑ H t=0 wi tγt ri t rt t′ > t st, at J (πθ) ≈ ∑ n i=1 ∑ H t=0 γt ( wi t (ri t − ̂Qπθ (st, at)) − wi t−1 𝔼a∼πθ(a ∣ st) [ ̂Qπθ (st, a)]) (st, at) ̂Q(st, at) πβ 18
  • 19. off-policy off-policy policy gradient • • • self-normalization πθ ∇θJ(πθ) ∇θJ (πθ) = 𝔼τ∼πβ(τ) [ πθ(τ) πβ(τ) H ∑ t=0 γt ∇θlog πθ (at ∣ st) ̂A (st, at) ] = 𝔼τ∼πβ(τ) ( H ∏ t=0 πθ (at ∣ st) β (at ∣ st) ) H ∑ t=0 γt ∇θlog πθ (at ∣ st) ̂A (st, at) ≈ n ∑ i=1 wi H H ∑ t=0 γt ∇θlog πθ (ai t ∣ si t) ̂A (si t, ai t) {(si t, ai t, ri t, si t+1, ⋯)} n i=1 πβ n 19
  • 20. off-policy off-policy policy gradient • • • • Doubly robust estimator • off-policy • Softmax • • ∇θJ (πθ) ≈ ∑ n i=1 ∑ H t=0 wi tγt ( ∑ H t′=t γt′−t wi t′ wi t rt′ − b (si t)) ∇θlog πθ (ai t ∣ si t) ∇θ ¯J (πθ) ≈ (∑ n i=1 wi H ∑ H t=0 γt ∇θlog πθ (ai t ∣ si t) ̂A (si t, ai t)) + λ log (∑ n i=1 wi H) 20
  • 21. off-policy • • biased sub-optimal [Imani+18] 
 [Fu+19] • 
 • off-policy • • off-policy • DRL DPG [Silver+14] IPG[Gu+17b] dπ (s) dπβ(s) Jβ(πθ) = 𝔼s∼dπβ(s) [Vπ (s)] J(πθ) dπβ(s) 𝒟 ∇θJβ (πθ) = 𝔼s∼dβ(s),a∼πθ(a∣s) [Qπθ(s, a)∇θlog πθ(a ∣ s) + ∇θQπθ(s, a)] ≈ 𝔼s∼dβ(s),a∼πθ(a∣s) [Qπθ(s, a)∇θlog πθ(a ∣ s)] Qπθ(s, a) 21
  • 22. Marginalized Importance Sampling • bias • Forward Bellman Equation • • Backward Bellman Equation • • DualDICE[Nachum+19a] AlgaeDICE[Nachum+19b] [Nachum+20] • RLArch ρπθ(s) = dπθ(s) dβ(s) ρπθ(s) 22
  • 23. Marginalized Importance Sampling • Forward Bellman Equation • • 
 • TD [Gelada+19] • • ρπθ(s) ∀s′, dπβ (s′) ρπ (s′) (dπβ∘ρπ )(s,a) = (1 − γ)d0 (s′) + γ∑s,a dπβ(s)ρπ (s)π(a ∣ s)T (s′ ∣ s, a) (ℬπ ∘ρπ )(s,a) ̂ρπ (s′) ← ̂ρπ (s′) + α [ (1 − γ) + γ π(a ∣ s) πβ(a ∣ s) ̂ρπ (s) − ̂ρπ (s′) ] s ∼ dπβ a ∼ π(a|s) s′ ∼ T(s′|s, a) 23
  • 24. Marginalized Importance Sampling • Forward Bellman Equation • [Liu+18] • • [Mousavi+20, Kallus+19, Tang+19, Uehara+19] • GenDICE[Zhang+20] • 1 
 f-divergence min ρ max f L(ρ, f )2 L(ρ, f ) = γ𝔼s,a,s′∼𝒟 [( ρ(s) π(a ∣ s) πβ(a ∣ s) − ρ (s′) ) f (s′) ] + (1 − γ)𝔼s0∼d0 [(1 − ρ(s))f(s)] πβ 𝒟 min ρπ Df ((ℬπ ∘ ρπ )(s, a), (dπβ ∘ ρπ )(s, a)) s.t. Es,a,s′∼𝒟 [ρπ (s, a)] = 1 24
  • 25. RL off-policy • RL 
 • • 
 πβ πβ πθ πθ πβ 25
  • 26. off-policy 
 DP state-marginal • RL OoD • DP • 26
  • 28. DP • • • • • Q-learning Q • RL Q • [Agarwal+19] dπβ(s) dπ (s) π(a|s) arg max Q(s, a) 28
  • 29. • policy constraint • • uncertainty-based • Q epistemic uncertainty π πβ 29
  • 30. 
 Q OoD • actor • Q • explicit f- • implicit f- • integral probability metric (IPM) π(a′|s′) πβ(a′|s′) 30
  • 31. • actor • • Q • ̂Qπ k+1 ← arg min Q 𝔼(s,a,s′)∼𝒟 ( Q(s, a) − ( r(s, a) + γ𝔼a′∼πk(a′ ∣ s′) [ ̂Qπ k (s′, a′)])) 2 πk+1 ← arg max π 𝔼s∼𝒟 [ 𝔼a∼π(a∣s) [ ̂Qπ k+1(s, a)]] s.t. D (π, πβ) ≤ ϵ ̂Qπ k+1 ← arg min Q 𝔼(s,a,s′)∼𝒟 ( Q(s, a) − ( r(s, a) + γ𝔼a′∼πk(a′ ∣ s′) [ ̂Qπ k (s′, a′)] − αγD (πk( ⋅ ∣ s′), πβ( ⋅ ∣ s′)))) 2 πk+1 ← arg max π 𝔼s∼𝒟 [ 𝔼a∼π(a∣s) [ ̂Qπ k+1(s, a) − αD (π( ⋅ ∣ s), πβ( ⋅ ∣ s))]] 31
  • 32. • explicit f- • KL WOP[Jaques+19] BRAC[Wu+19] • implicit f- • KL AWR[Peng+19] ABM[Siegel+20] • • • integral probability metric (IPM) • MMD BEAR [Kumar+19] Wasserstein BRAC[Wu+19] ¯πk+1(a ∣ s) ← 1 Z πβ(a ∣ s)exp ( 1 α Qπ k (s, a) ) πk+1 ← arg min π DKL (¯πk+1, π) πβ(a|s) exp ( 1 α Qπ k (s, a) ) 32
  • 33. • RL KL(f-divergence) • uniform RL 
 KL exploit 
 • 
 
 • 
 [Kumar+19] 33
  • 34. Q epstemic uncertainty • • Q Q • Q BEAR [Kumar+19] • Q [Agarwal+19] πk+1 ← arg maxπ 𝔼s∼𝒟 [ 𝔼a∼π(a∣s) [ 𝔼Qπ k+1∼𝒫 𝒟(Qπ ) [Qπ k+1(s, a)] − α Unc ( 𝒫 𝒟 (Qπ ))]] conservative estimate  𝒫 𝒟(Qπ ) 𝒟 Unc Unc Unc 34
  • 35. calibration • • [Fujimoto+18a] • Q • • • open problem 𝒫 𝒟 Unc 𝒟 πβ 𝒟 35
  • 37. • • MPC [Tassa+12] • BPTT PILCO[Deisenroth+11] • Dyna[Sutton+91] PIPPS[Parmas+19] • RL • • distribution shift • RL surprisingly limited • Tψ (st+1 ∣ st, at) 37
  • 38. RL 
 NeurIPS2020 MOReL [Kidambi+20*] • • Pessimistic MDP • 
 MOPO [Yu+20*] • • BREMEN [Matsushima+20*] • 
 + TRPO • RL 5-10% 
 38
  • 39. long-horizon MDP open problem • RL • DFP[Dosovitskiy+17], VPN[Oh+17], BADGR[Kahn+20] • DP RL • fitted value iteration [Vanseijen+15,Parr+08] 39
  • 40. 7. 40
  • 41. RL counterfactual • • i.i.d • [Schölkopf19] [Gal+16] [Kingma+14] 
 distributional robustness[Sinha+17] invariance[Arjovsky+19] • 
 visual foresight[Finn+17,Ebert+18] InteractionNet[Battaglia+16] RL RL 41
  • 42. 42
  • 43. RL • • D4RL[Kumar+20] RL Unplugged [Gulcehre+20*] • • IRIS[Mandlekar+19*] • • off-policy RL 
 • Control as Inference RL • • BC AWR[Peng+19] • GAIL[Ho+16*] • T-REX[Brown+19a*] D-REX[Brown+19b*] 43
  • 44. [Agarwal+19] Rishabh Agarwal, Dale Schuurmans, Mohammad Norouzi. “An Optimistic Perspective on Offline Reinforcement Learning” https://arxiv.org/abs/ 1907.04543 [Arjovsky+19] Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, David Lopez-Paz. "Invariant Risk Minimization”. https://arxiv.org/abs/1907.02893 [Battaglia+16] Peter W. Battaglia, Razvan Pascanu, Matthew Lai, Danilo Rezende, Koray Kavukcuoglu. “Interaction Networks for Learning about Objects, Relations and Physics”. https://arxiv.org/abs/1612.00222 [Brown+19a*] Daniel S. Brown, Wonjoon Goo, Prabhat Nagarajan, Scott Niekum. “Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Learning from Observations”. https://arxiv.org/abs/1904.06387 [Brown+19b*] Daniel S. Brown, Wonjoon Goo, Scott Niekum. “Better-than-Demonstrator Imitation Learning via Automatically-Ranked Demonstrations”. https://arxiv.org/ abs/1907.03976 [Deisenroth+11] Marc Deisenroth, Carl E. Rasmussen. "PILCO: A model-based and data-efficient approach to policy search." Proceedings of the 28th International Conference on machine learning. 2011. [Dosovitskiy+17] Alexey Dosovitskiy, Vladlen Koltun. “Learning to Act by Predicting the Future”. https://arxiv.org/abs/1611.01779 [Ebert+18] Frederik Ebert, Chelsea Finn, Sudeep Dasari, Annie Xie, Alex Lee, Sergey Levine. “Visual Foresight: Model-Based Deep Reinforcement Learning for Vision- Based Robotic Control”. https://arxiv.org/abs/1812.00568 [Finn+17] Chelsea Finn, Sergey Levine. “Deep Visual Foresight for Planning Robot Motion”. https://arxiv.org/abs/1610.00696 [Fu+19] Justin Fu, Aviral Kumar, Matthew Soh, Sergey Levine. “Diagnosing bottlenecks in deep Q-learning algorithms”. https://arxiv.org/abs/1902.10250 [Fujimoto+18a] Scott Fujimoto, David Meger, Doina Precup. “Off-Policy Deep Reinforcement Learning without Exploration”. https://arxiv.org/abs/1812.02900 [Gal+16] Yarin Gal, Zoubin Ghahramani. “Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning”. https://arxiv.org/abs/1506.02142 [Gelada+19] Carles Gelada, Marc G. Bellemare. “Off-policy deep reinforcement learning by bootstrapping the covariate shift”. https://arxiv.org/abs/1901.09455 [Gu+17] Shixiang Gu, Timothy Lillicrap, Zoubin Ghahramani, Richard E. Turner, Bernhard Schölkopf, Sergey Levine. “Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning”. https://arxiv.org/abs/1706.00387 [Gulcehre+20*] Caglar Gulcehre, Ziyu Wang, Alexander Novikov, Tom Le Paine, Sergio Gómez Colmenarejo, Konrad Zolna, Rishabh Agarwal, Josh Merel, Daniel Mankowitz, Cosmin Paduraru, Gabriel Dulac-Arnold, Jerry Li, Mohammad Norouzi, Matt Hoffman, Ofir Nachum, George Tucker, Nicolas Heess, Nando de Freitas. “RL Unplugged: Benchmarks for Offline Reinforcement Learning”. https://arxiv.org/abs/2006.13888 [Ho+16*] Jonathan Ho, Stefano Ermon. “Generative Adversarial Imitation Learning”. https://arxiv.org/abs/1606.03476 [Imani+18] Ehsan Imani, Eric Graves, Martha White. “An Off-policy Policy Gradient Theorem Using Emphatic Weightings”. https://arxiv.org/abs/1811.09013 44
  • 45. [Jaques+19] Natasha Jaques, Asma Ghandeharioun, Judy Hanwen Shen, Craig Ferguson, Agata Lapedriza, Noah Jones, Shixiang Gu, Rosalind Picard. “Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog”. https://arxiv.org/abs/1907.00456 [Kahn+20] Gregory Kahn, Pieter Abbeel, Sergey Levine. “BADGR: An Autonomous Self-Supervised Learning-Based Navigation System”. https://arxiv.org/abs/2002.05700 [Kalashnikov+18] Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, Sergey Levine. “QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation”. https://arxiv.org/abs/1806.10293 [Kallus+19] Nathan Kallus, Masatoshi Uehara. “Efficiently Breaking the Curse of Horizon in Off-Policy Evaluation with Double Reinforcement Learning”. https://arxiv.org/ abs/1909.05850 [Kidambi+20*] Rahul Kidambi, Aravind Rajeswaran, Praneeth Netrapalli, Thorsten Joachims. “MOReL : Model-Based Offline Reinforcement Learning”. https://arxiv.org/ abs/2005.05951 [Kingma+14] Diederik P. Kingma, Danilo J. Rezende, Shakir Mohamed, Max Welling. “Semi-Supervised Learning with Deep Generative Models” https://arxiv.org/abs/ 1406.5298 [Kumar+19] Aviral Kumar, Justin Fu, George Tucker, Sergey Levine. “Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction”. https://arxiv.org/abs/1906.00949 [Kumar+20] Does on-policy data collection fix errors in reinforcement learning? https://bair.berkeley.edu/blog/2020/03/16/discor/. [Lange+12] Sascha Lange, Thomas Gabel, Martin Riedmiller. “Batch reinforcement learning”. Reinforcement learning. Springer, 2012. 45-73. [Liu+18] Qiang Liu, Lihong Li, Ziyang Tang, Dengyong Zhou. “Breaking the curse of horizon: Infinite-horizon off-policy estimation”. https://arxiv.org/abs/1810.12429 [Mandlekar+19*] Ajay Mandlekar, Fabio Ramos, Byron Boots, Silvio Savarese, Li Fei-Fei, Animesh Garg, Dieter Fox. “IRIS: Implicit Reinforcement without Interaction at Scale for Learning Control from Offline Robot Manipulation Data”. https://arxiv.org/abs/1911.05321 [Matsushima+20*] Tatsuya Matsushima, Hiroki Furuta, Yutaka Matsuo, Ofir Nachum, Shixiang Gu. “Deployment-Efficient Reinforcement Learning via Model-Based Offline Optimization”. https://arxiv.org/abs/2006.03647 [Mousavi+20] Ali Mousavi, Lihong Li, Qiang Liu, Denny Zhou. “Black-box Off-policy Estimation for Infinite-Horizon Reinforcement Learning”. https://arxiv.org/abs/ 2003.11126 [Nachum+19a] Ofir Nachum, Yinlam Chow, Bo Dai, Lihong Li. “DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections”. https:// arxiv.org/abs/1906.04733 [Nachum+19b] Ofir Nachum, Bo Dai, Ilya Kostrikov, Yinlam Chow, Lihong Li, Dale Schuurmans. “AlgaeDICE: Policy Gradient from Arbitrary Experience”. https://arxiv.org/ abs/1912.02074 [Nachum+20] Ofir Nachum, Bo Dai. “Reinforcement Learning via Fenchel-Rockafellar Duality”. https://arxiv.org/abs/2001.01866 45
  • 46. [Oh+17] Junhyuk Oh, Satinder Singh, Honglak Lee. “Value Prediction Network”. https://arxiv.org/abs/1707.03497 [Parr+08] Parr, Ronald, Lihong Li, Gavin Taylor, Christopher Painter-Wakefield, Michael L. Littman. “An analysis of linear models, linear value-function approximation, and feature selection for reinforcement learning”. In Proceedings of the 25th international conference on Machine learning. 2008. [Parmas+19] Paavo Parmas, Carl Edward Rasmussen, Jan Peters, Kenji Doya. “PIPPS: Flexible Model-Based Policy Search Robust to the Curse of Chaos”. https:// arxiv.org/abs/1902.01240 [Peng+19] Xue Bin Peng, Aviral Kumar, Grace Zhang, Sergey Levine. "Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning”. https:// arxiv.org/abs/1910.00177 [Schölkopf19] Bernhard Schölkopf. “Causality for machine learning”. https://arxiv.org/abs/1911.10500 [Siegel+20] Noah Y. Siegel, Jost Tobias Springenberg, Felix Berkenkamp, Abbas Abdolmaleki, Michael Neunert, Thomas Lampe, Roland Hafner, Nicolas Heess, Martin Riedmiller. “Keep Doing What Worked: Behavioral Modelling Priors for Offline Reinforcement Learning”. https://arxiv.org/abs/2002.08396 [Sinha+17] Aman Sinha, Hongseok Namkoong, Riccardo Volpi, John Duchi. “Certifying Some Distributional Robustness with Principled Adversarial Training”. https:// arxiv.org/abs/1710.10571 [Silver+14] David Silver. Guy Lever. Nicolas Heess. Thomas Degris. Daan Wierstra. Martin Riedmiller. ”Deterministic policy gradient algorithms”. Proceedings of the 31st International Conference on International Conference on Machine Learning. 2014. [Sutton+91] Dyna, an integrated architecture for learning, planning, and reacting”. ACM Sigart Bulletin, 2(4), 160-163. 1991. [Tassa+12] Yuval Tassa, Tom Erez, Emanuel Todorov. "Synthesis and stabilization of complex behaviors through online trajectory optimization." 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems. 2012. [Tang+19] Ziyang Tang, Yihao Feng, Lihong Li, Dengyong Zhou, Qiang Liu. “Doubly Robust Bias Reduction in Infinite Horizon Off-Policy Estimation”. https://arxiv.org/abs/ 1910.07186 [Thomas+15] Philip S. Thomas, Georgios Theocharous, Mohammad Ghavamzadeh. "High-confidence off-policy evaluation”. Twenty-Ninth AAAI Conference on Artificial Intelligence. 2015. [Uehara+19] Masatoshi Uehara, Jiawei Huang, Nan Jiang. “Minimax Weight and Q-Function Learning for Off-Policy Evaluation”, https://arxiv.org/abs/1910.12809 [Vanseijen+15] Harm Vanseijen, Rich Sutton. "A deeper look at planning as learning from replay." In International conference on machine learning. 2015. [Wu+19] Yifan Wu, George Tucker, Ofir Nachum. “Behavior Regularized Offline Reinforcement Learning”. https://arxiv.org/abs/1911.11361 [Yu+20*] Tianhe Yu, Garrett Thomas, Lantao Yu, Stefano Ermon, James Zou, Sergey Levine, Chelsea Finn, Tengyu Ma. “MOPO: Model-based Offline Policy Optimization”. https://arxiv.org/abs/2005.13239 [Zhang+20] Ruiyi Zhang, Bo Dai, Lihong Li, Dale Schuurmans. “GenDICE: Generalized Offline Estimation of Stationary Values”. https://arxiv.org/abs/2002.09072 46
  • 47. 47