SlideShare a Scribd company logo
Safe and Efficient Off-Policy
Reinforcement Learning
NIPS 2016
Yasuhiro Fujita
Preferred Networks Inc.
January 19, 2017
Safe and Efficient Off-Policy Reinforcement Learning
by Remi Munos, Thomas Stepleton, Anna Harutyunyan and Marc G. Bellemare
▶ Off-policy RL: learning the value function for one policy π
Qπ
(x, a) = Eπ[r1 + γr2 + γ2
r3 + · · · |x0 = x, a0 = a]
from data collected by another policy µ ̸= π
▶ Retrace(λ): a new off-policy multi-step RL algorithm
▶ Theoretical advantages
+ It converges for any π, µ (safe)
+ It makes the best use of samples if π and µ are close to
each other (efficient)
+ Its variance is lower than importance sampling
▶ Empirical evaluation
▶ On Atari 2600 it beats one-step Q-learning (DQN) and
the existing multi-step methods (Q∗(λ), Tree-Backup)
Notation and definitions
▶ state x ∈ X
▶ action a ∈ A
▶ discount factor γ ∈ [0, 1]
▶ immediate reward r ∈ R
▶ policies π, µ : X × A → [0, 1]
▶ value function
Qπ
(x, a) := Eπ[r1 + γr2 + γ2
r3 + · · · |x0 = x, a0 = a]
▶ optimal value function Q∗
:= maxπ Qπ
▶ EπQ(x, ·) :=
∑
a π(a|x)Q(x, a)
Policy evaluation
▶ Learning the value function for a policy π:
Qπ
(x, a) = Eπ[r1 + γr2 + γ2
r3 + · · · |x0 = x, a0 = a]
▶ You can learn optimal control if π is a greedy policy to
the current estimate Q(x, a) e.g. Q-learning
▶ On-policy: learning from data collected by π
▶ Off-policy: learning from data collected by µ ̸= π
▶ Off-policy methods have advantages:
+ Sample-efficient (e.g. experience replay)
+ Exploration by µ
On-policy multi-step methods
From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf
▶ Temporal difference (or “surprise”) at t:
δt = rt + γQ(xt+1, at+1) − Q(xt, at)
▶ You can use δt to estimate Qπ
(xt, at) (one-step)
▶ Can you use δt to estimate Qπ
(xs, as) for all s ≤ t?
(multi-step)
TD(λ)
From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf
▶ A popular multi-step algorithm for on-policy policy
evaluation
▶ ∆tQ(x, a) = (γλ)t
δt, where λ ∈ [0, 1] is chosen to
balance bias and variance
▶ Multi-step methods have advantages:
+ Rewards are propagated rapidly
+ Bias introduced by bootstrapping is reduced
Off-policy multi-step algorithm
From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf
▶ δt = rt + γEπQ(xt+1, ·) − Q(xt, at)
▶ You can use δt to estimate Qπ
(xt, at) e.g. Q-learning
▶ Can you use δt to estimate Qπ
(xs, as) for all s ≤ t?
▶ δt might be less relevant to Qπ(xs, as) compared to the
on-policy case
Importance Sampling (IS) [Precup et al. 2000]
From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf
▶ ∆tQ(x, a) = γt
(
∏
1≤s≤t
π(as |xs )
µ(as |xs )
)δt
+ Unbiased estimate of Qπ
− Large (possibly infinite) variance since π(as |xs )
µ(as |xs )
is not
bounded
Qπ
(λ) [Harutyunyan et al. 2016]
From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf
▶ ∆tQ(x, a) = (γλ)t
δt
+ Convergent if µ and π are sufficiently close to each other
or λ is sufficiently small:
λ < 1−γ
γϵ
, where ϵ := maxx ∥π(·|x) − µ(·|x)∥1
− Not convergent otherwise
Tree-Backup (TB) [Precup et al. 2000]
From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf
▶ ∆tQ(x, a) = (γλ)t
(
∏
1≤s≤t π(as|xs))δt
+ Convergent for any π and µ
+ Works even if µ is unknown and/or non-Markov
−
∏
1≤s≤t π(as|xs) decays rapidly when near on-policy
A unified view
▶ General algorithm: ∆Q(x, a) =
∑
t≥0 γt
(
∏
1≤s≤t cs)δt
▶ None of the existing methods is perfect
▶ Low variance (↔ IS)
▶ “Safe” i.e. convergent for any π and µ (↔ Qπ(λ))
▶ “Efficient” i.e. using full returns when on-policy (↔
Tree-Backup)
Choice of the coefficients cs
▶ Contraction speed
▶ Consider a general operator R:
RQ(x, a) = Q(x, a) + Eµ[
∑
t≥0
γt
(
∏
1≤s≤t
cs)δt]
▶ If 0 ≤ cs ≤ π(as |xs )
µ(as |xs ) , R is a contraction and Qπ is its
fixed point (thus the algorithm is “safe”)
|RQ(x, a) − Qπ
(x, a)| ≤ η(x, a)∥Q − Qπ
∥
η(x, a) := 1 − (1 − γ)Eµ[
∑
t≥0
γt
(
t∏
s=1
cs)]
▶ η = 0 for cs = 1 (“efficient”)
▶ Variance
▶ cs ≤ 1 result in low variance since
∏
1≤s≤t cs ≤ 1
Retrace(λ)
From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf
▶ ∆Q(x, a) = γt
(
∏
1≤s≤t λ min(1, π(as |xs )
µ(as |xs )
))δt
+ Variance is bounded
+ Convergent for any π and µ
+ Uses full returns when on-policy
− Doesn’t work if µ is unknown or non-Markov (↔
Tree-Backup)
Evaluation on Atari 2600
▶ Trained asynchrounously with 16 CPU threads [Mnih
et al. 2016]
▶ Each thread has private replay memory holding 62,500
transitions
▶ Q-learning uses a minibatch of 64 transitions
▶ Retrace, TB and Q∗
(λ) (a control version of Qπ
(λ)) use
four 16-step sub-sequences
Performance comparison
▶ Inter-algorithm scores are normalized so that 0 and 1
respectively correspond to the worst and best scores for a
particular game
▶ λ = 1 performs best except Q∗
▶ Retrace(λ) performs best on 30 out of 60 games
Sensitivity to the value of λ
▶ Retrace(λ) is robust and consistently outperforms Tree-Backup
▶ Q* performs best for small values of λ
▶ Note that the Q-learning scores are fixed across different λ
Conclusions
▶ Retrace(λ)
▶ is an off-policy multi-step value-based RL algorithm
▶ is low-variance, safe and efficient
▶ outperforms one-step Q-learning and existing multi-step
variants on Atari 2600
▶ (is already applied to A3C in another paper [Wang et al.
2016])
References I
[1] Anna Harutyunyan et al. “Q(λ) with Off-Policy Corrections”. In: Proceedings
of Algorithmic Learning Theory (ALT). 2016. arXiv: 1602.04951.
[2] Volodymyr Mnih et al. Asynchronous Methods for Deep Reinforcement
Learning (old). 2016. arXiv: 1602.01783.
[3] Doina Precup, Richard S Sutton, and Satinder P Singh. “Eligibility Traces for
Off-Policy Policy Evaluation”. In: ICML ’00: Proceedings of the Seventeenth
International Conference on Machine Learning (2000), pp. 759–766.
[4] Ziyu Wang et al. “Sample Efficient Actor-Critic with Experience Replay”. In:
arXiv (2016), pp. 1–20. arXiv: 1611.01224.

More Related Content

What's hot

Gradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation GraphsGradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation Graphs
Yoonho Lee
 
safe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningsafe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learning
Ryo Iwaki
 
Variational Autoencoder
Variational AutoencoderVariational Autoencoder
Variational Autoencoder
Mark Chang
 
Matching networks for one shot learning
Matching networks for one shot learningMatching networks for one shot learning
Matching networks for one shot learning
Kazuki Fujikawa
 
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
MLconf
 
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Universitat Politècnica de Catalunya
 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learn
DataRobot
 
VAE-type Deep Generative Models
VAE-type Deep Generative ModelsVAE-type Deep Generative Models
VAE-type Deep Generative Models
Kenta Oono
 
Speaker Diarization
Speaker DiarizationSpeaker Diarization
Speaker Diarization
HONGJOO LEE
 
InfoGAN and Generative Adversarial Networks
InfoGAN and Generative Adversarial NetworksInfoGAN and Generative Adversarial Networks
InfoGAN and Generative Adversarial Networks
Zak Jost
 
Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15
Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15
Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15
MLconf
 
Lecture 5: Neural Networks II
Lecture 5: Neural Networks IILecture 5: Neural Networks II
Lecture 5: Neural Networks II
Sang Jun Lee
 
Dueling network architectures for deep reinforcement learning
Dueling network architectures for deep reinforcement learningDueling network architectures for deep reinforcement learning
Dueling network architectures for deep reinforcement learning
Taehoon Kim
 
Interaction Networks for Learning about Objects, Relations and Physics
Interaction Networks for Learning about Objects, Relations and PhysicsInteraction Networks for Learning about Objects, Relations and Physics
Interaction Networks for Learning about Objects, Relations and Physics
Ken Kuroki
 
Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Opti...
Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Opti...Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Opti...
Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Opti...
Fabian Pedregosa
 
Dictionary Learning for Massive Matrix Factorization
Dictionary Learning for Massive Matrix FactorizationDictionary Learning for Massive Matrix Factorization
Dictionary Learning for Massive Matrix Factorization
recsysfr
 
Data-Driven Recommender Systems
Data-Driven Recommender SystemsData-Driven Recommender Systems
Data-Driven Recommender Systems
recsysfr
 
Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona...
Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona...Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona...
Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona...
Universitat Politècnica de Catalunya
 
ddpg seminar
ddpg seminarddpg seminar
ddpg seminar
민재 정
 
[Paper reading] L-SHAPLEY AND C-SHAPLEY: EFFICIENT MODEL INTERPRETATION FOR S...
[Paper reading] L-SHAPLEY AND C-SHAPLEY: EFFICIENT MODEL INTERPRETATION FOR S...[Paper reading] L-SHAPLEY AND C-SHAPLEY: EFFICIENT MODEL INTERPRETATION FOR S...
[Paper reading] L-SHAPLEY AND C-SHAPLEY: EFFICIENT MODEL INTERPRETATION FOR S...
Daiki Tanaka
 

What's hot (20)

Gradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation GraphsGradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation Graphs
 
safe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningsafe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learning
 
Variational Autoencoder
Variational AutoencoderVariational Autoencoder
Variational Autoencoder
 
Matching networks for one shot learning
Matching networks for one shot learningMatching networks for one shot learning
Matching networks for one shot learning
 
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
 
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learn
 
VAE-type Deep Generative Models
VAE-type Deep Generative ModelsVAE-type Deep Generative Models
VAE-type Deep Generative Models
 
Speaker Diarization
Speaker DiarizationSpeaker Diarization
Speaker Diarization
 
InfoGAN and Generative Adversarial Networks
InfoGAN and Generative Adversarial NetworksInfoGAN and Generative Adversarial Networks
InfoGAN and Generative Adversarial Networks
 
Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15
Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15
Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15
 
Lecture 5: Neural Networks II
Lecture 5: Neural Networks IILecture 5: Neural Networks II
Lecture 5: Neural Networks II
 
Dueling network architectures for deep reinforcement learning
Dueling network architectures for deep reinforcement learningDueling network architectures for deep reinforcement learning
Dueling network architectures for deep reinforcement learning
 
Interaction Networks for Learning about Objects, Relations and Physics
Interaction Networks for Learning about Objects, Relations and PhysicsInteraction Networks for Learning about Objects, Relations and Physics
Interaction Networks for Learning about Objects, Relations and Physics
 
Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Opti...
Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Opti...Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Opti...
Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Opti...
 
Dictionary Learning for Massive Matrix Factorization
Dictionary Learning for Massive Matrix FactorizationDictionary Learning for Massive Matrix Factorization
Dictionary Learning for Massive Matrix Factorization
 
Data-Driven Recommender Systems
Data-Driven Recommender SystemsData-Driven Recommender Systems
Data-Driven Recommender Systems
 
Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona...
Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona...Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona...
Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona...
 
ddpg seminar
ddpg seminarddpg seminar
ddpg seminar
 
[Paper reading] L-SHAPLEY AND C-SHAPLEY: EFFICIENT MODEL INTERPRETATION FOR S...
[Paper reading] L-SHAPLEY AND C-SHAPLEY: EFFICIENT MODEL INTERPRETATION FOR S...[Paper reading] L-SHAPLEY AND C-SHAPLEY: EFFICIENT MODEL INTERPRETATION FOR S...
[Paper reading] L-SHAPLEY AND C-SHAPLEY: EFFICIENT MODEL INTERPRETATION FOR S...
 

Viewers also liked

Learning to learn by gradient descent by gradient descent
Learning to learn by gradient descent by gradient descentLearning to learn by gradient descent by gradient descent
Learning to learn by gradient descent by gradient descent
Hiroyuki Fukuda
 
Conditional Image Generation with PixelCNN Decoders
Conditional Image Generation with PixelCNN DecodersConditional Image Generation with PixelCNN Decoders
Conditional Image Generation with PixelCNN Decoders
suga93
 
時系列データ3
時系列データ3時系列データ3
時系列データ3graySpace999
 
Fast and Probvably Seedings for k-Means
Fast and Probvably Seedings for k-MeansFast and Probvably Seedings for k-Means
Fast and Probvably Seedings for k-Means
Kimikazu Kato
 
Value iteration networks
Value iteration networksValue iteration networks
Value iteration networks
Fujimoto Keisuke
 
Introduction of “Fairness in Learning: Classic and Contextual Bandits”
Introduction of “Fairness in Learning: Classic and Contextual Bandits”Introduction of “Fairness in Learning: Classic and Contextual Bandits”
Introduction of “Fairness in Learning: Classic and Contextual Bandits”
Kazuto Fukuchi
 
[DL輪読会]Convolutional Sequence to Sequence Learning
[DL輪読会]Convolutional Sequence to Sequence Learning[DL輪読会]Convolutional Sequence to Sequence Learning
[DL輪読会]Convolutional Sequence to Sequence Learning
Deep Learning JP
 
論文紹介 Combining Model-Based and Model-Free Updates for Trajectory-Centric Rein...
論文紹介 Combining Model-Based and Model-Free Updates for Trajectory-Centric Rein...論文紹介 Combining Model-Based and Model-Free Updates for Trajectory-Centric Rein...
論文紹介 Combining Model-Based and Model-Free Updates for Trajectory-Centric Rein...
Kusano Hitoshi
 
NIPS 2016 Overview and Deep Learning Topics
NIPS 2016 Overview and Deep Learning Topics  NIPS 2016 Overview and Deep Learning Topics
NIPS 2016 Overview and Deep Learning Topics
Koichi Hamada
 
ICML2016読み会 概要紹介
ICML2016読み会 概要紹介ICML2016読み会 概要紹介
ICML2016読み会 概要紹介
Kohei Hayashi
 
論文紹介 Pixel Recurrent Neural Networks
論文紹介 Pixel Recurrent Neural Networks論文紹介 Pixel Recurrent Neural Networks
論文紹介 Pixel Recurrent Neural Networks
Seiya Tokui
 

Viewers also liked (11)

Learning to learn by gradient descent by gradient descent
Learning to learn by gradient descent by gradient descentLearning to learn by gradient descent by gradient descent
Learning to learn by gradient descent by gradient descent
 
Conditional Image Generation with PixelCNN Decoders
Conditional Image Generation with PixelCNN DecodersConditional Image Generation with PixelCNN Decoders
Conditional Image Generation with PixelCNN Decoders
 
時系列データ3
時系列データ3時系列データ3
時系列データ3
 
Fast and Probvably Seedings for k-Means
Fast and Probvably Seedings for k-MeansFast and Probvably Seedings for k-Means
Fast and Probvably Seedings for k-Means
 
Value iteration networks
Value iteration networksValue iteration networks
Value iteration networks
 
Introduction of “Fairness in Learning: Classic and Contextual Bandits”
Introduction of “Fairness in Learning: Classic and Contextual Bandits”Introduction of “Fairness in Learning: Classic and Contextual Bandits”
Introduction of “Fairness in Learning: Classic and Contextual Bandits”
 
[DL輪読会]Convolutional Sequence to Sequence Learning
[DL輪読会]Convolutional Sequence to Sequence Learning[DL輪読会]Convolutional Sequence to Sequence Learning
[DL輪読会]Convolutional Sequence to Sequence Learning
 
論文紹介 Combining Model-Based and Model-Free Updates for Trajectory-Centric Rein...
論文紹介 Combining Model-Based and Model-Free Updates for Trajectory-Centric Rein...論文紹介 Combining Model-Based and Model-Free Updates for Trajectory-Centric Rein...
論文紹介 Combining Model-Based and Model-Free Updates for Trajectory-Centric Rein...
 
NIPS 2016 Overview and Deep Learning Topics
NIPS 2016 Overview and Deep Learning Topics  NIPS 2016 Overview and Deep Learning Topics
NIPS 2016 Overview and Deep Learning Topics
 
ICML2016読み会 概要紹介
ICML2016読み会 概要紹介ICML2016読み会 概要紹介
ICML2016読み会 概要紹介
 
論文紹介 Pixel Recurrent Neural Networks
論文紹介 Pixel Recurrent Neural Networks論文紹介 Pixel Recurrent Neural Networks
論文紹介 Pixel Recurrent Neural Networks
 

Similar to Safe and Efficient Off-Policy Reinforcement Learning

Continuous control
Continuous controlContinuous control
Continuous control
Reiji Hatsugai
 
Learning to discover monte carlo algorithm on spin ice manifold
Learning to discover monte carlo algorithm on spin ice manifoldLearning to discover monte carlo algorithm on spin ice manifold
Learning to discover monte carlo algorithm on spin ice manifold
Kai-Wen Zhao
 
week10_Reinforce.pdf
week10_Reinforce.pdfweek10_Reinforce.pdf
week10_Reinforce.pdf
YuChianWu
 
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Jack Clark
 
Double Q-learning Paper Reading
Double Q-learning Paper ReadingDouble Q-learning Paper Reading
Double Q-learning Paper Reading
Takato Yamazaki
 
block-mdp-masters-defense.pdf
block-mdp-masters-defense.pdfblock-mdp-masters-defense.pdf
block-mdp-masters-defense.pdf
Junghyun Lee
 
Policy-Gradient for deep reinforcement learning.pdf
Policy-Gradient for  deep reinforcement learning.pdfPolicy-Gradient for  deep reinforcement learning.pdf
Policy-Gradient for deep reinforcement learning.pdf
21522733
 
Bayesian Deep Learning
Bayesian Deep LearningBayesian Deep Learning
Bayesian Deep Learning
RayKim51
 
Hierarchical Reinforcement Learning with Option-Critic Architecture
Hierarchical Reinforcement Learning with Option-Critic ArchitectureHierarchical Reinforcement Learning with Option-Critic Architecture
Hierarchical Reinforcement Learning with Option-Critic Architecture
Necip Oguz Serbetci
 
DDPG algortihm for angry birds
DDPG algortihm for angry birdsDDPG algortihm for angry birds
DDPG algortihm for angry birds
Wangyu Han
 
ML unit-1.pptx
ML unit-1.pptxML unit-1.pptx
ML unit-1.pptx
SwarnaKumariChinni
 
Massive Matrix Factorization : Applications to collaborative filtering
Massive Matrix Factorization : Applications to collaborative filteringMassive Matrix Factorization : Applications to collaborative filtering
Massive Matrix Factorization : Applications to collaborative filtering
Arthur Mensch
 
Machine learning in science and industry — day 2
Machine learning in science and industry — day 2Machine learning in science and industry — day 2
Machine learning in science and industry — day 2
arogozhnikov
 
Matrix Factorizations for Recommender Systems
Matrix Factorizations for Recommender SystemsMatrix Factorizations for Recommender Systems
Matrix Factorizations for Recommender Systems
Dmitriy Selivanov
 
MLHEP Lectures - day 1, basic track
MLHEP Lectures - day 1, basic trackMLHEP Lectures - day 1, basic track
MLHEP Lectures - day 1, basic track
arogozhnikov
 
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
The Statistical and Applied Mathematical Sciences Institute
 
2021 06-02-tabnet
2021 06-02-tabnet2021 06-02-tabnet
2021 06-02-tabnet
JAEMINJEONG5
 
MLHEP Lectures - day 2, basic track
MLHEP Lectures - day 2, basic trackMLHEP Lectures - day 2, basic track
MLHEP Lectures - day 2, basic track
arogozhnikov
 
Distributed Group Analytical Hierarchical Process by Consensus
 Distributed Group Analytical Hierarchical Process by Consensus Distributed Group Analytical Hierarchical Process by Consensus
Distributed Group Analytical Hierarchical Process by Consensus
Miguel Rebollo
 

Similar to Safe and Efficient Off-Policy Reinforcement Learning (20)

Continuous control
Continuous controlContinuous control
Continuous control
 
Learning to discover monte carlo algorithm on spin ice manifold
Learning to discover monte carlo algorithm on spin ice manifoldLearning to discover monte carlo algorithm on spin ice manifold
Learning to discover monte carlo algorithm on spin ice manifold
 
week10_Reinforce.pdf
week10_Reinforce.pdfweek10_Reinforce.pdf
week10_Reinforce.pdf
 
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
 
SASA 2016
SASA 2016SASA 2016
SASA 2016
 
Double Q-learning Paper Reading
Double Q-learning Paper ReadingDouble Q-learning Paper Reading
Double Q-learning Paper Reading
 
block-mdp-masters-defense.pdf
block-mdp-masters-defense.pdfblock-mdp-masters-defense.pdf
block-mdp-masters-defense.pdf
 
Policy-Gradient for deep reinforcement learning.pdf
Policy-Gradient for  deep reinforcement learning.pdfPolicy-Gradient for  deep reinforcement learning.pdf
Policy-Gradient for deep reinforcement learning.pdf
 
Bayesian Deep Learning
Bayesian Deep LearningBayesian Deep Learning
Bayesian Deep Learning
 
Hierarchical Reinforcement Learning with Option-Critic Architecture
Hierarchical Reinforcement Learning with Option-Critic ArchitectureHierarchical Reinforcement Learning with Option-Critic Architecture
Hierarchical Reinforcement Learning with Option-Critic Architecture
 
DDPG algortihm for angry birds
DDPG algortihm for angry birdsDDPG algortihm for angry birds
DDPG algortihm for angry birds
 
ML unit-1.pptx
ML unit-1.pptxML unit-1.pptx
ML unit-1.pptx
 
Massive Matrix Factorization : Applications to collaborative filtering
Massive Matrix Factorization : Applications to collaborative filteringMassive Matrix Factorization : Applications to collaborative filtering
Massive Matrix Factorization : Applications to collaborative filtering
 
Machine learning in science and industry — day 2
Machine learning in science and industry — day 2Machine learning in science and industry — day 2
Machine learning in science and industry — day 2
 
Matrix Factorizations for Recommender Systems
Matrix Factorizations for Recommender SystemsMatrix Factorizations for Recommender Systems
Matrix Factorizations for Recommender Systems
 
MLHEP Lectures - day 1, basic track
MLHEP Lectures - day 1, basic trackMLHEP Lectures - day 1, basic track
MLHEP Lectures - day 1, basic track
 
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
 
2021 06-02-tabnet
2021 06-02-tabnet2021 06-02-tabnet
2021 06-02-tabnet
 
MLHEP Lectures - day 2, basic track
MLHEP Lectures - day 2, basic trackMLHEP Lectures - day 2, basic track
MLHEP Lectures - day 2, basic track
 
Distributed Group Analytical Hierarchical Process by Consensus
 Distributed Group Analytical Hierarchical Process by Consensus Distributed Group Analytical Hierarchical Process by Consensus
Distributed Group Analytical Hierarchical Process by Consensus
 

More from mooopan

Clipped Action Policy Gradient
Clipped Action Policy GradientClipped Action Policy Gradient
Clipped Action Policy Gradient
mooopan
 
Model-Based Reinforcement Learning @NIPS2017
Model-Based Reinforcement Learning @NIPS2017Model-Based Reinforcement Learning @NIPS2017
Model-Based Reinforcement Learning @NIPS2017
mooopan
 
ChainerRLの紹介
ChainerRLの紹介ChainerRLの紹介
ChainerRLの紹介
mooopan
 
A3Cという強化学習アルゴリズムで遊んでみた話
A3Cという強化学習アルゴリズムで遊んでみた話A3Cという強化学習アルゴリズムで遊んでみた話
A3Cという強化学習アルゴリズムで遊んでみた話
mooopan
 
最近のDQN
最近のDQN最近のDQN
最近のDQN
mooopan
 
Learning Continuous Control Policies by Stochastic Value Gradients
Learning Continuous Control Policies by Stochastic Value GradientsLearning Continuous Control Policies by Stochastic Value Gradients
Learning Continuous Control Policies by Stochastic Value Gradients
mooopan
 
Trust Region Policy Optimization
Trust Region Policy OptimizationTrust Region Policy Optimization
Trust Region Policy Optimization
mooopan
 
Effective Modern C++ Item 24: Distinguish universal references from rvalue re...
Effective Modern C++ Item 24: Distinguish universal references from rvalue re...Effective Modern C++ Item 24: Distinguish universal references from rvalue re...
Effective Modern C++ Item 24: Distinguish universal references from rvalue re...
mooopan
 
"Playing Atari with Deep Reinforcement Learning"
"Playing Atari with Deep Reinforcement Learning""Playing Atari with Deep Reinforcement Learning"
"Playing Atari with Deep Reinforcement Learning"
mooopan
 

More from mooopan (9)

Clipped Action Policy Gradient
Clipped Action Policy GradientClipped Action Policy Gradient
Clipped Action Policy Gradient
 
Model-Based Reinforcement Learning @NIPS2017
Model-Based Reinforcement Learning @NIPS2017Model-Based Reinforcement Learning @NIPS2017
Model-Based Reinforcement Learning @NIPS2017
 
ChainerRLの紹介
ChainerRLの紹介ChainerRLの紹介
ChainerRLの紹介
 
A3Cという強化学習アルゴリズムで遊んでみた話
A3Cという強化学習アルゴリズムで遊んでみた話A3Cという強化学習アルゴリズムで遊んでみた話
A3Cという強化学習アルゴリズムで遊んでみた話
 
最近のDQN
最近のDQN最近のDQN
最近のDQN
 
Learning Continuous Control Policies by Stochastic Value Gradients
Learning Continuous Control Policies by Stochastic Value GradientsLearning Continuous Control Policies by Stochastic Value Gradients
Learning Continuous Control Policies by Stochastic Value Gradients
 
Trust Region Policy Optimization
Trust Region Policy OptimizationTrust Region Policy Optimization
Trust Region Policy Optimization
 
Effective Modern C++ Item 24: Distinguish universal references from rvalue re...
Effective Modern C++ Item 24: Distinguish universal references from rvalue re...Effective Modern C++ Item 24: Distinguish universal references from rvalue re...
Effective Modern C++ Item 24: Distinguish universal references from rvalue re...
 
"Playing Atari with Deep Reinforcement Learning"
"Playing Atari with Deep Reinforcement Learning""Playing Atari with Deep Reinforcement Learning"
"Playing Atari with Deep Reinforcement Learning"
 

Recently uploaded

Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
Globus
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
Adele Miller
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
Matt Welsh
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Globus
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
Tier1 app
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
Fermin Galan
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
Donna Lenk
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Anthony Dahanne
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Globus
 
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Mind IT Systems
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
Georgi Kodinov
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
abdulrafaychaudhry
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Globus
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
Google
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
Paco van Beckhoven
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Globus
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Natan Silnitsky
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
Max Andersen
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
Ortus Solutions, Corp
 
Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024
Globus
 

Recently uploaded (20)

Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
 
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
 
Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024
 

Safe and Efficient Off-Policy Reinforcement Learning

  • 1. Safe and Efficient Off-Policy Reinforcement Learning NIPS 2016 Yasuhiro Fujita Preferred Networks Inc. January 19, 2017
  • 2. Safe and Efficient Off-Policy Reinforcement Learning by Remi Munos, Thomas Stepleton, Anna Harutyunyan and Marc G. Bellemare ▶ Off-policy RL: learning the value function for one policy π Qπ (x, a) = Eπ[r1 + γr2 + γ2 r3 + · · · |x0 = x, a0 = a] from data collected by another policy µ ̸= π ▶ Retrace(λ): a new off-policy multi-step RL algorithm ▶ Theoretical advantages + It converges for any π, µ (safe) + It makes the best use of samples if π and µ are close to each other (efficient) + Its variance is lower than importance sampling ▶ Empirical evaluation ▶ On Atari 2600 it beats one-step Q-learning (DQN) and the existing multi-step methods (Q∗(λ), Tree-Backup)
  • 3. Notation and definitions ▶ state x ∈ X ▶ action a ∈ A ▶ discount factor γ ∈ [0, 1] ▶ immediate reward r ∈ R ▶ policies π, µ : X × A → [0, 1] ▶ value function Qπ (x, a) := Eπ[r1 + γr2 + γ2 r3 + · · · |x0 = x, a0 = a] ▶ optimal value function Q∗ := maxπ Qπ ▶ EπQ(x, ·) := ∑ a π(a|x)Q(x, a)
  • 4. Policy evaluation ▶ Learning the value function for a policy π: Qπ (x, a) = Eπ[r1 + γr2 + γ2 r3 + · · · |x0 = x, a0 = a] ▶ You can learn optimal control if π is a greedy policy to the current estimate Q(x, a) e.g. Q-learning ▶ On-policy: learning from data collected by π ▶ Off-policy: learning from data collected by µ ̸= π ▶ Off-policy methods have advantages: + Sample-efficient (e.g. experience replay) + Exploration by µ
  • 5. On-policy multi-step methods From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf ▶ Temporal difference (or “surprise”) at t: δt = rt + γQ(xt+1, at+1) − Q(xt, at) ▶ You can use δt to estimate Qπ (xt, at) (one-step) ▶ Can you use δt to estimate Qπ (xs, as) for all s ≤ t? (multi-step)
  • 6. TD(λ) From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf ▶ A popular multi-step algorithm for on-policy policy evaluation ▶ ∆tQ(x, a) = (γλ)t δt, where λ ∈ [0, 1] is chosen to balance bias and variance ▶ Multi-step methods have advantages: + Rewards are propagated rapidly + Bias introduced by bootstrapping is reduced
  • 7. Off-policy multi-step algorithm From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf ▶ δt = rt + γEπQ(xt+1, ·) − Q(xt, at) ▶ You can use δt to estimate Qπ (xt, at) e.g. Q-learning ▶ Can you use δt to estimate Qπ (xs, as) for all s ≤ t? ▶ δt might be less relevant to Qπ(xs, as) compared to the on-policy case
  • 8. Importance Sampling (IS) [Precup et al. 2000] From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf ▶ ∆tQ(x, a) = γt ( ∏ 1≤s≤t π(as |xs ) µ(as |xs ) )δt + Unbiased estimate of Qπ − Large (possibly infinite) variance since π(as |xs ) µ(as |xs ) is not bounded
  • 9. Qπ (λ) [Harutyunyan et al. 2016] From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf ▶ ∆tQ(x, a) = (γλ)t δt + Convergent if µ and π are sufficiently close to each other or λ is sufficiently small: λ < 1−γ γϵ , where ϵ := maxx ∥π(·|x) − µ(·|x)∥1 − Not convergent otherwise
  • 10. Tree-Backup (TB) [Precup et al. 2000] From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf ▶ ∆tQ(x, a) = (γλ)t ( ∏ 1≤s≤t π(as|xs))δt + Convergent for any π and µ + Works even if µ is unknown and/or non-Markov − ∏ 1≤s≤t π(as|xs) decays rapidly when near on-policy
  • 11. A unified view ▶ General algorithm: ∆Q(x, a) = ∑ t≥0 γt ( ∏ 1≤s≤t cs)δt ▶ None of the existing methods is perfect ▶ Low variance (↔ IS) ▶ “Safe” i.e. convergent for any π and µ (↔ Qπ(λ)) ▶ “Efficient” i.e. using full returns when on-policy (↔ Tree-Backup)
  • 12. Choice of the coefficients cs ▶ Contraction speed ▶ Consider a general operator R: RQ(x, a) = Q(x, a) + Eµ[ ∑ t≥0 γt ( ∏ 1≤s≤t cs)δt] ▶ If 0 ≤ cs ≤ π(as |xs ) µ(as |xs ) , R is a contraction and Qπ is its fixed point (thus the algorithm is “safe”) |RQ(x, a) − Qπ (x, a)| ≤ η(x, a)∥Q − Qπ ∥ η(x, a) := 1 − (1 − γ)Eµ[ ∑ t≥0 γt ( t∏ s=1 cs)] ▶ η = 0 for cs = 1 (“efficient”) ▶ Variance ▶ cs ≤ 1 result in low variance since ∏ 1≤s≤t cs ≤ 1
  • 13. Retrace(λ) From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf ▶ ∆Q(x, a) = γt ( ∏ 1≤s≤t λ min(1, π(as |xs ) µ(as |xs ) ))δt + Variance is bounded + Convergent for any π and µ + Uses full returns when on-policy − Doesn’t work if µ is unknown or non-Markov (↔ Tree-Backup)
  • 14. Evaluation on Atari 2600 ▶ Trained asynchrounously with 16 CPU threads [Mnih et al. 2016] ▶ Each thread has private replay memory holding 62,500 transitions ▶ Q-learning uses a minibatch of 64 transitions ▶ Retrace, TB and Q∗ (λ) (a control version of Qπ (λ)) use four 16-step sub-sequences
  • 15. Performance comparison ▶ Inter-algorithm scores are normalized so that 0 and 1 respectively correspond to the worst and best scores for a particular game ▶ λ = 1 performs best except Q∗ ▶ Retrace(λ) performs best on 30 out of 60 games
  • 16. Sensitivity to the value of λ ▶ Retrace(λ) is robust and consistently outperforms Tree-Backup ▶ Q* performs best for small values of λ ▶ Note that the Q-learning scores are fixed across different λ
  • 17. Conclusions ▶ Retrace(λ) ▶ is an off-policy multi-step value-based RL algorithm ▶ is low-variance, safe and efficient ▶ outperforms one-step Q-learning and existing multi-step variants on Atari 2600 ▶ (is already applied to A3C in another paper [Wang et al. 2016])
  • 18. References I [1] Anna Harutyunyan et al. “Q(λ) with Off-Policy Corrections”. In: Proceedings of Algorithmic Learning Theory (ALT). 2016. arXiv: 1602.04951. [2] Volodymyr Mnih et al. Asynchronous Methods for Deep Reinforcement Learning (old). 2016. arXiv: 1602.01783. [3] Doina Precup, Richard S Sutton, and Satinder P Singh. “Eligibility Traces for Off-Policy Policy Evaluation”. In: ICML ’00: Proceedings of the Seventeenth International Conference on Machine Learning (2000), pp. 759–766. [4] Ziyu Wang et al. “Sample Efficient Actor-Critic with Experience Replay”. In: arXiv (2016), pp. 1–20. arXiv: 1611.01224.