SlideShare a Scribd company logo
1 of 72
Download to read offline
Hierarchical Reinforcement Learning with
Option-Critic Architecture
Oğuz Şerbetci
April 4, 2018
Modelling of Cognitive Processes
TU Berlin
Reinforcement Learning
Hierarchical Reinforcement Learning
Demonstration
Resources
Appendix
1
Reinforcement Learning
Reinforcement Learning
Agent
Environment
2
Reinforcement Learning
Agent
Environment
Action at
2
Reinforcement Learning
Agent
Environment
Action atState st
Reward rt
2
Reinforcement Learning
MDP: S, A, p(s |s, a), r(s, a)
3
Reinforcement Learning
MDP: S, A, p(s |s, a), r(s, a)
3
Reinforcement Learning
MDP: S, A, p(s |s, a), r(s, a)
Policy π(a|s) : S × A → [0, 1]
3
Reinforcement Learning
MDP: S, A, p(s |s, a), r(s, a)
Policy π(a|s) : S × A → [0, 1]
3
Reinforcement Learning
MDP: S, A, p(s |s, a), r(s, a)
Policy π(a|s) : S × A → [0, 1]
Goal: an optimal policy π∗ that maximizes E
∞
t=0
γt
rt|s0 = s
3
Problems
4
Problems
• lack of planning and commitment
4
Problems
• lack of planning and commitment
• inefficient exploration
4
Problems
• lack of planning and commitment
• inefficient exploration
• temporal credit assignment problem
4
Problems
• lack of planning and commitment
• inefficient exploration
• temporal credit assignment problem
• inability to divide-and-conquer
4
5
Hierarchical Reinforcement Learning
Temporal Abstractions
Icons made by Smashicons and Freepick from Freepick
6
Temporal Abstractions
Icons made by Smashicons and Freepick from Freepick
6
Temporal Abstractions
Icons made by Smashicons and Freepick from Freepick
6
Temporal Abstractions
Icons made by Smashicons and Freepick from Freepick
6
7
Options Framework (Sutton, Precup, et al. 1999)
SMDP:
S, A, p(s , k |s, a), r(s, a)
(Sutton, Precup, et al. 1999)
8
Options Framework (Sutton, Precup, et al. 1999)
SMDP:
S, A, p(s , k |s, a), r(s, a)
Option:
Iω: initiation set
πω: intra-option-policy
βω: termination-policy
(Sutton, Precup, et al. 1999)
8
Options Framework (Sutton, Precup, et al. 1999)
SMDP:
S, A, p(s , k |s, a), r(s, a)
Option:
Iω: initiation set
πω: intra-option-policy
βω: termination-policy
(Sutton, Precup, et al. 1999)
8
Options Framework (Sutton, Precup, et al. 1999)
SMDP:
S, A, p(s , k |s, a), r(s, a)
Option:
Iω: initiation set
πω: intra-option-policy
βω: termination-policy
πΩ: option-policy (Sutton, Precup, et al. 1999)
8
9
Option-Critic (Bacon et al. 2017)
Given the number of options Option-Critic learns βω, πω, πΩ in
an end-to-end & online fashion.
Allows non-linear function approximators, enabling
continuous state and action spaces.
• online, end-to-end learning of options in continuous
state/action space
• allows using non-linear function approximators (deep RL)
10
Value Functions
The state value function:
Vπ(s) = E
∞
t=0
γt
rt|s0 = s
11
Value Functions
The state value function:
Vπ(s) = E
∞
t=0
γt
rt|s0 = s
=
a∈As
π(s, a) r(s, a) + γ
s
p(s |s, a)V(s )
Bellman Equations (Bellman 1952)
11
Value Functions
The state value function:
Vπ(s) = E
∞
t=0
γt
rt|s0 = s
=
a∈As
π(s, a) r(s, a) + γ
s
p(s |s, a)V(s )
The action value function:
Qπ(s, a) = E
∞
t=0
γt
rt|s0 = s, a0 = a
Bellman Equations (Bellman 1952)
11
Value Functions
The state value function:
Vπ(s) = E
∞
t=0
γt
rt|s0 = s
=
a∈As
π(s, a) r(s, a) + γ
s
p(s |s, a)V(s )
The action value function:
Qπ(s, a) = E
∞
t=0
γt
rt|s0 = s, a0 = a
= r(s, a) + γ
s
p(s |s, a)
a ∈A
π(s , a )Q(s , a )
Bellman Equations (Bellman 1952)
11
Value methods
TD Learning:
Qπ
(s, a) ←Qπ
(s, a) + α
TD target
r + γVπ
(s ) −Qπ
(s, a)
TD error
12
Value methods
TD Learning:
Qπ
(s, a) ←Qπ
(s, a) + α
TD target
r + γVπ
(s ) −Qπ
(s, a)
TD error
Vπ
(s ) = max
a
Q(s , a) Q-Learning
12
Value methods
TD Learning:
Qπ
(s, a) ←Qπ
(s, a) + α
TD target
r + γVπ
(s ) −Qπ
(s, a)
TD error
Vπ
(s ) = max
a
Q(s , a) Q-Learning
π(a|s) = argmax
a
Q(s, a) greedy policy
12
Value methods
TD Learning:
Qπ
(s, a) ←Qπ
(s, a) + α
TD target
r + γVπ
(s ) −Qπ
(s, a)
TD error
Vπ
(s ) = max
a
Q(s , a) Q-Learning
π(a|s) = argmax
a
Q(s, a) greedy policy
π(a|s) =
random with probability
argmaxa Q(s, a) with probability 1 −
- greedy policy
12
Option Value Functions
The option value function:
QΩ(s, ω) =
a
πω(a|s)QU (s, w, a)
13
Option Value Functions
The option value function:
QΩ(s, ω) =
a
πω(a|s)QU (s, w, a)
The action value function:
QU (s, ω, a) = r(s, a) + γ
s
p(s |s, a)U(ω, s )
13
Option Value Functions
The option value function:
QΩ(s, ω) =
a
πω(a|s)QU (s, w, a)
The action value function:
QU (s, ω, a) = r(s, a) + γ
s
p(s |s, a)U(ω, s )
The state value function upon arrival:
U(ω, s ) = (1 − βω(s ))QΩ(s , ω) + βω(s )VΩ(s )
13
Policy Gradient Methods
π(a|s) = argmax
a
Q(s, a)
14
Policy Gradient Methods
π(a|s) = argmax
a
Q(s, a) vs. π(a|s, θ) = softmax
a
h(s, a, θ)
14
Policy Gradient Methods
π(a|s) = argmax
a
Q(s, a) vs. π(a|s, θ) = softmax
a
h(s, a, θ)
J(θ)
14
Policy Gradient Methods
π(a|s) = argmax
a
Q(s, a) vs. π(a|s, θ) = softmax
a
h(s, a, θ)
J(θ) = V∗
(s)
14
Policy Gradient Methods
π(a|s) = argmax
a
Q(s, a) vs. π(a|s, θ) = softmax
a
h(s, a, θ)
J(θ) = V∗
(s)
θJ(θ)
14
Policy Gradient Methods
π(a|s) = argmax
a
Q(s, a) vs. π(a|s, θ) = softmax
a
h(s, a, θ)
J(θ) = V∗
(s)
θJ(θ)
Policy Gradient Theorem (Sutton, McAllester, et al. 2000)
14
Policy Gradient Methods
π(a|s) = argmax
a
Q(s, a) vs. π(a|s, θ) = softmax
a
h(s, a, θ)
J(θ) = V∗
(s)
θJ(θ)
Policy Gradient Theorem (Sutton, McAllester, et al. 2000)
θJ(θ) =
s
µπ(s)
a
Qπ(s, a) θπ(a|s, θ)
14
Policy Gradient Methods
π(a|s) = argmax
a
Q(s, a) vs. π(a|s, θ) = softmax
a
h(s, a, θ)
J(θ) = V∗
(s)
θJ(θ)
Policy Gradient Theorem (Sutton, McAllester, et al. 2000)
θJ(θ) =
s
µπ(s)
a
Qπ(s, a) θπ(a|s, θ)
θJ(θ) = E γt
a
Qπ(s, a) θπ(a|s, θ)
14
Actor-Critic (Sutton 1984)
θ ← θ + αγt
δ
TD-Error
θ log π(a|s, θ)
Taken from Pierre-Luc Bacon 15
Option-Critic (Bacon et al. 2017)
Taken from (Bacon et al. 2017)
16
Option-Critic (Bacon et al. 2017)
The gradient wrt. intra-option-policy parameters θ:
θQΩ(s, ω)
take better primitives inside options.
17
Option-Critic (Bacon et al. 2017)
The gradient wrt. intra-option-policy parameters θ:
θQΩ(s, ω) = E
∂ log πω,θ(a|s)
∂θ
QU (s, ω, a)
take better primitives inside options.
17
Option-Critic (Bacon et al. 2017)
The gradient wrt. intra-option-policy parameters θ:
θQΩ(s, ω) = E
∂ log πω,θ(a|s)
∂θ
QU (s, ω, a)
take better primitives inside options.
The gradient wrt. termination-policy parameters ϑ:
ϑU(ω, s )
shorten options with bad advantage.
17
Option-Critic (Bacon et al. 2017)
The gradient wrt. intra-option-policy parameters θ:
θQΩ(s, ω) = E
∂ log πω,θ(a|s)
∂θ
QU (s, ω, a)
take better primitives inside options.
The gradient wrt. termination-policy parameters ϑ:
ϑU(ω, s ) = E −
∂βω,ϑ(s )
∂ϑ
QΩ(s , ω) − VΩ(s , ω)
e.g. maxω Q(s,w)
shorten options with bad advantage.
17
Demonstration
18
19
20
20
Complex Environment i
(Bacon et al. 2017)
21
Complex Environment ii
(Harb et al. 2017)
22
But... i
23
But... ii
(Dilokthanakul et al. 2017)
24
Resources
• Sutton & Barto, Reinforcement Learning: An Introduction,
Second Edition Draft
• David Silver’s Reinforcement Learning Course
25
References i
Bacon, Pierre-Luc, Jean Harb, and Doina Precup (2017). “The
Option-Critic architecture”. In: Proceedings of the Thirty-First
AAAI Conference on Artificial Intelligence, February 4-9, 2017,
San Francisco, California, USA. Pp. 1726–1734.
Bellman, Richard (1952). “On the theory of dynamic
programming”. In: Proceedings of the National Academy of
Sciences 38.8, pp. 716–719. doi: 10.1073/pnas.38.8.716.
Dilokthanakul, N., C. Kaplanis, N. Pawlowski, and M. Shanahan
(2017). “Feature Control as Intrinsic Motivation for Hierarchical
Reinforcement Learning”. In: ArXiv e-prints. arXiv:
1705.06769 [cs.LG].
26
References ii
Harb, Jean, Pierre-Luc Bacon, Martin Klissarov, and Doina Precup
(2017). “When waiting is not an option: Learning options with a
deliberation cost”. In: arXiv: 1709.04571.
Sutton, Richard S (1984). “Temporal credit assignment in
reinforcement learning”. AAI8410337. PhD thesis.
Sutton, Richard S, David A McAllester, Satinder P Singh, and
Yishay Mansour (2000). “Policy gradient methods for
reinforcement learning with function approximation”. In:
Advances in Neural Information Processing Systems,
pp. 1057–1063.
27
References iii
Sutton, Richard S, Doina Precup, and Satinder Singh (1999).
“Between MDPs and Semi-MDPs: A framework for temporal
abstraction in reinforcement learning”. In: Artificial
Intelligence 112.1-2, pp. 181–211. doi:
10.1016/S0004-3702(99)00052-1.
28
Appendix
Option-Critic (Bacon et al. 2017) i
procedure train(α, NΩ)
s ← s0
choose ω ∼ πΩ(ω|s) Option-policy
repeat
choose a ∼ πω,θ(a|s) Intra-option-policy
take the action a in s, observe s and r
1. Options evaluation
g ← r TD-Target
if s is not terminal then
g ← g + γ(1 − βω,ϑ(s ))QΩ(s , ω)
+ γ βω,ϑ(s ) max
ω
QΩ(s , ω)
29
Option-Critic (Bacon et al. 2017) ii
2. Critic improvement
δU ← g − QU (s, ω, a)
QU (s, ω, a) ← QU (s, ω, a) + αU δU
3. Intra-option Q-learning
δΩ ← g − QΩ(s, ω)
QΩ(s, ω) ← QΩ(s, ω) + αΩδΩ
4. Options improvement
θ ← θ + αθ
∂ log πω,θ(a|s)
∂θ QU (s, ω, a)
ϑ ← ϑ + αϑ
∂βω,ϑ(s )
∂ϑ (QΩ(s , ω) − maxω QΩ(s , ω) + ξ)
30
Option-Critic (Bacon et al. 2017) iii
if terminate ∼ βω,ϑ(s ) then Termination-policy
choose ω ∼ πΩ(ω|s )
s ← s .
until s is terminal
31

More Related Content

What's hot

Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningKhaled Saleh
 
Reinforcement Learning 4. Dynamic Programming
Reinforcement Learning 4. Dynamic ProgrammingReinforcement Learning 4. Dynamic Programming
Reinforcement Learning 4. Dynamic ProgrammingSeung Jae Lee
 
SISTEM KOMUNIKASI DATA (SISKOMDAT)
SISTEM KOMUNIKASI DATA (SISKOMDAT)SISTEM KOMUNIKASI DATA (SISKOMDAT)
SISTEM KOMUNIKASI DATA (SISKOMDAT)Maya Ayunanda
 
입문 Visual SLAM - 5장 카메라와 이미지
입문 Visual SLAM - 5장 카메라와 이미지입문 Visual SLAM - 5장 카메라와 이미지
입문 Visual SLAM - 5장 카메라와 이미지jdo
 
Introducing Zap Q-Learning
Introducing Zap Q-Learning   Introducing Zap Q-Learning
Introducing Zap Q-Learning Sean Meyn
 
Safe Reinforcement Learning
Safe Reinforcement LearningSafe Reinforcement Learning
Safe Reinforcement LearningDongmin Lee
 
Hierarchical Reinforcement Learning
Hierarchical Reinforcement LearningHierarchical Reinforcement Learning
Hierarchical Reinforcement LearningDavid Jardim
 
Lecture 6
Lecture 6Lecture 6
Lecture 6hunglq
 
Autonomic nervous system testing arfa sulthana
Autonomic nervous system testing arfa sulthanaAutonomic nervous system testing arfa sulthana
Autonomic nervous system testing arfa sulthanavrkv2007
 
CNN Attention Networks
CNN Attention NetworksCNN Attention Networks
CNN Attention NetworksTaeoh Kim
 
An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)pauldix
 
Lecture 9 Markov decision process
Lecture 9 Markov decision processLecture 9 Markov decision process
Lecture 9 Markov decision processVARUN KUMAR
 
Multisensor Data Fusion : Techno Briefing
Multisensor Data Fusion : Techno BriefingMultisensor Data Fusion : Techno Briefing
Multisensor Data Fusion : Techno BriefingPaveen Juntama
 
The Human Connectome Project multimodal cortical parcellation: new avenues fo...
The Human Connectome Project multimodal cortical parcellation: new avenues fo...The Human Connectome Project multimodal cortical parcellation: new avenues fo...
The Human Connectome Project multimodal cortical parcellation: new avenues fo...Emma Robinson
 
가깝고도 먼 Trpo
가깝고도 먼 Trpo가깝고도 먼 Trpo
가깝고도 먼 TrpoWoong won Lee
 
Actor critic algorithm
Actor critic algorithmActor critic algorithm
Actor critic algorithmJie-Han Chen
 
Control as Inference.pptx
Control as Inference.pptxControl as Inference.pptx
Control as Inference.pptxssuserbd1647
 

What's hot (20)

Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
 
Neural ODE
Neural ODENeural ODE
Neural ODE
 
Reinforcement Learning 4. Dynamic Programming
Reinforcement Learning 4. Dynamic ProgrammingReinforcement Learning 4. Dynamic Programming
Reinforcement Learning 4. Dynamic Programming
 
SISTEM KOMUNIKASI DATA (SISKOMDAT)
SISTEM KOMUNIKASI DATA (SISKOMDAT)SISTEM KOMUNIKASI DATA (SISKOMDAT)
SISTEM KOMUNIKASI DATA (SISKOMDAT)
 
입문 Visual SLAM - 5장 카메라와 이미지
입문 Visual SLAM - 5장 카메라와 이미지입문 Visual SLAM - 5장 카메라와 이미지
입문 Visual SLAM - 5장 카메라와 이미지
 
Anesthesia and IONM
Anesthesia and IONM Anesthesia and IONM
Anesthesia and IONM
 
Introducing Zap Q-Learning
Introducing Zap Q-Learning   Introducing Zap Q-Learning
Introducing Zap Q-Learning
 
Safe Reinforcement Learning
Safe Reinforcement LearningSafe Reinforcement Learning
Safe Reinforcement Learning
 
Hierarchical Reinforcement Learning
Hierarchical Reinforcement LearningHierarchical Reinforcement Learning
Hierarchical Reinforcement Learning
 
Lecture 6
Lecture 6Lecture 6
Lecture 6
 
Parietal lobe.ppt
Parietal lobe.pptParietal lobe.ppt
Parietal lobe.ppt
 
Autonomic nervous system testing arfa sulthana
Autonomic nervous system testing arfa sulthanaAutonomic nervous system testing arfa sulthana
Autonomic nervous system testing arfa sulthana
 
CNN Attention Networks
CNN Attention NetworksCNN Attention Networks
CNN Attention Networks
 
An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)
 
Lecture 9 Markov decision process
Lecture 9 Markov decision processLecture 9 Markov decision process
Lecture 9 Markov decision process
 
Multisensor Data Fusion : Techno Briefing
Multisensor Data Fusion : Techno BriefingMultisensor Data Fusion : Techno Briefing
Multisensor Data Fusion : Techno Briefing
 
The Human Connectome Project multimodal cortical parcellation: new avenues fo...
The Human Connectome Project multimodal cortical parcellation: new avenues fo...The Human Connectome Project multimodal cortical parcellation: new avenues fo...
The Human Connectome Project multimodal cortical parcellation: new avenues fo...
 
가깝고도 먼 Trpo
가깝고도 먼 Trpo가깝고도 먼 Trpo
가깝고도 먼 Trpo
 
Actor critic algorithm
Actor critic algorithmActor critic algorithm
Actor critic algorithm
 
Control as Inference.pptx
Control as Inference.pptxControl as Inference.pptx
Control as Inference.pptx
 

Similar to Hierarchical Reinforcement Learning with Option-Critic Architecture

Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIJack Clark
 
safe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningsafe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningRyo Iwaki
 
Optimization of probabilistic argumentation with Markov processes
Optimization of probabilistic argumentation with Markov processesOptimization of probabilistic argumentation with Markov processes
Optimization of probabilistic argumentation with Markov processesEmmanuel Hadoux
 
Optimizing a New Nonlinear Reinforcement Scheme with Breeder genetic algorithm
Optimizing a New Nonlinear Reinforcement Scheme with Breeder genetic algorithmOptimizing a New Nonlinear Reinforcement Scheme with Breeder genetic algorithm
Optimizing a New Nonlinear Reinforcement Scheme with Breeder genetic algorithminfopapers
 
Learning to discover monte carlo algorithm on spin ice manifold
Learning to discover monte carlo algorithm on spin ice manifoldLearning to discover monte carlo algorithm on spin ice manifold
Learning to discover monte carlo algorithm on spin ice manifoldKai-Wen Zhao
 
Safe and Efficient Off-Policy Reinforcement Learning
Safe and Efficient Off-Policy Reinforcement LearningSafe and Efficient Off-Policy Reinforcement Learning
Safe and Efficient Off-Policy Reinforcement Learningmooopan
 
block-mdp-masters-defense.pdf
block-mdp-masters-defense.pdfblock-mdp-masters-defense.pdf
block-mdp-masters-defense.pdfJunghyun Lee
 
Max Entropy
Max EntropyMax Entropy
Max Entropyjianingy
 
Counterfactual Learning for Recommendation
Counterfactual Learning for RecommendationCounterfactual Learning for Recommendation
Counterfactual Learning for RecommendationOlivier Jeunen
 
A model-based relevance estimation approach for feature selection in microarr...
A model-based relevance estimation approach for feature selection in microarr...A model-based relevance estimation approach for feature selection in microarr...
A model-based relevance estimation approach for feature selection in microarr...Gianluca Bontempi
 
Efficient Identification of Improving Moves in a Ball for Pseudo-Boolean Prob...
Efficient Identification of Improving Moves in a Ball for Pseudo-Boolean Prob...Efficient Identification of Improving Moves in a Ball for Pseudo-Boolean Prob...
Efficient Identification of Improving Moves in a Ball for Pseudo-Boolean Prob...jfrchicanog
 
Actors for Behavioural Simulation
Actors for Behavioural SimulationActors for Behavioural Simulation
Actors for Behavioural SimulationClarkTony
 
Introduction to Reinforcement Learning for Molecular Design
Introduction to Reinforcement Learning for Molecular Design Introduction to Reinforcement Learning for Molecular Design
Introduction to Reinforcement Learning for Molecular Design Dan Elton
 
Generic Reinforcement Schemes and Their Optimization
Generic Reinforcement Schemes and Their OptimizationGeneric Reinforcement Schemes and Their Optimization
Generic Reinforcement Schemes and Their Optimizationinfopapers
 
Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3Charles Martin
 
week10_Reinforce.pdf
week10_Reinforce.pdfweek10_Reinforce.pdf
week10_Reinforce.pdfYuChianWu
 

Similar to Hierarchical Reinforcement Learning with Option-Critic Architecture (20)

Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
 
safe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningsafe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learning
 
Classification
ClassificationClassification
Classification
 
Optimization of probabilistic argumentation with Markov processes
Optimization of probabilistic argumentation with Markov processesOptimization of probabilistic argumentation with Markov processes
Optimization of probabilistic argumentation with Markov processes
 
Optimizing a New Nonlinear Reinforcement Scheme with Breeder genetic algorithm
Optimizing a New Nonlinear Reinforcement Scheme with Breeder genetic algorithmOptimizing a New Nonlinear Reinforcement Scheme with Breeder genetic algorithm
Optimizing a New Nonlinear Reinforcement Scheme with Breeder genetic algorithm
 
Continuous control
Continuous controlContinuous control
Continuous control
 
Learning to discover monte carlo algorithm on spin ice manifold
Learning to discover monte carlo algorithm on spin ice manifoldLearning to discover monte carlo algorithm on spin ice manifold
Learning to discover monte carlo algorithm on spin ice manifold
 
Safe and Efficient Off-Policy Reinforcement Learning
Safe and Efficient Off-Policy Reinforcement LearningSafe and Efficient Off-Policy Reinforcement Learning
Safe and Efficient Off-Policy Reinforcement Learning
 
block-mdp-masters-defense.pdf
block-mdp-masters-defense.pdfblock-mdp-masters-defense.pdf
block-mdp-masters-defense.pdf
 
Max Entropy
Max EntropyMax Entropy
Max Entropy
 
Counterfactual Learning for Recommendation
Counterfactual Learning for RecommendationCounterfactual Learning for Recommendation
Counterfactual Learning for Recommendation
 
ddpg seminar
ddpg seminarddpg seminar
ddpg seminar
 
RL unit 5 part 1.pdf
RL unit 5 part 1.pdfRL unit 5 part 1.pdf
RL unit 5 part 1.pdf
 
A model-based relevance estimation approach for feature selection in microarr...
A model-based relevance estimation approach for feature selection in microarr...A model-based relevance estimation approach for feature selection in microarr...
A model-based relevance estimation approach for feature selection in microarr...
 
Efficient Identification of Improving Moves in a Ball for Pseudo-Boolean Prob...
Efficient Identification of Improving Moves in a Ball for Pseudo-Boolean Prob...Efficient Identification of Improving Moves in a Ball for Pseudo-Boolean Prob...
Efficient Identification of Improving Moves in a Ball for Pseudo-Boolean Prob...
 
Actors for Behavioural Simulation
Actors for Behavioural SimulationActors for Behavioural Simulation
Actors for Behavioural Simulation
 
Introduction to Reinforcement Learning for Molecular Design
Introduction to Reinforcement Learning for Molecular Design Introduction to Reinforcement Learning for Molecular Design
Introduction to Reinforcement Learning for Molecular Design
 
Generic Reinforcement Schemes and Their Optimization
Generic Reinforcement Schemes and Their OptimizationGeneric Reinforcement Schemes and Their Optimization
Generic Reinforcement Schemes and Their Optimization
 
Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3
 
week10_Reinforce.pdf
week10_Reinforce.pdfweek10_Reinforce.pdf
week10_Reinforce.pdf
 

Recently uploaded

Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Christo Ananth
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)Suman Mia
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...roncy bisnoi
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).pptssuser5c9d4b1
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college projectTonystark477637
 

Recently uploaded (20)

Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 

Hierarchical Reinforcement Learning with Option-Critic Architecture

  • 1. Hierarchical Reinforcement Learning with Option-Critic Architecture Oğuz Şerbetci April 4, 2018 Modelling of Cognitive Processes TU Berlin
  • 2. Reinforcement Learning Hierarchical Reinforcement Learning Demonstration Resources Appendix 1
  • 7. Reinforcement Learning MDP: S, A, p(s |s, a), r(s, a) 3
  • 8. Reinforcement Learning MDP: S, A, p(s |s, a), r(s, a) 3
  • 9. Reinforcement Learning MDP: S, A, p(s |s, a), r(s, a) Policy π(a|s) : S × A → [0, 1] 3
  • 10. Reinforcement Learning MDP: S, A, p(s |s, a), r(s, a) Policy π(a|s) : S × A → [0, 1] 3
  • 11. Reinforcement Learning MDP: S, A, p(s |s, a), r(s, a) Policy π(a|s) : S × A → [0, 1] Goal: an optimal policy π∗ that maximizes E ∞ t=0 γt rt|s0 = s 3
  • 13. Problems • lack of planning and commitment 4
  • 14. Problems • lack of planning and commitment • inefficient exploration 4
  • 15. Problems • lack of planning and commitment • inefficient exploration • temporal credit assignment problem 4
  • 16. Problems • lack of planning and commitment • inefficient exploration • temporal credit assignment problem • inability to divide-and-conquer 4
  • 17. 5
  • 19. Temporal Abstractions Icons made by Smashicons and Freepick from Freepick 6
  • 20. Temporal Abstractions Icons made by Smashicons and Freepick from Freepick 6
  • 21. Temporal Abstractions Icons made by Smashicons and Freepick from Freepick 6
  • 22. Temporal Abstractions Icons made by Smashicons and Freepick from Freepick 6
  • 23. 7
  • 24. Options Framework (Sutton, Precup, et al. 1999) SMDP: S, A, p(s , k |s, a), r(s, a) (Sutton, Precup, et al. 1999) 8
  • 25. Options Framework (Sutton, Precup, et al. 1999) SMDP: S, A, p(s , k |s, a), r(s, a) Option: Iω: initiation set πω: intra-option-policy βω: termination-policy (Sutton, Precup, et al. 1999) 8
  • 26. Options Framework (Sutton, Precup, et al. 1999) SMDP: S, A, p(s , k |s, a), r(s, a) Option: Iω: initiation set πω: intra-option-policy βω: termination-policy (Sutton, Precup, et al. 1999) 8
  • 27. Options Framework (Sutton, Precup, et al. 1999) SMDP: S, A, p(s , k |s, a), r(s, a) Option: Iω: initiation set πω: intra-option-policy βω: termination-policy πΩ: option-policy (Sutton, Precup, et al. 1999) 8
  • 28. 9
  • 29. Option-Critic (Bacon et al. 2017) Given the number of options Option-Critic learns βω, πω, πΩ in an end-to-end & online fashion. Allows non-linear function approximators, enabling continuous state and action spaces. • online, end-to-end learning of options in continuous state/action space • allows using non-linear function approximators (deep RL) 10
  • 30. Value Functions The state value function: Vπ(s) = E ∞ t=0 γt rt|s0 = s 11
  • 31. Value Functions The state value function: Vπ(s) = E ∞ t=0 γt rt|s0 = s = a∈As π(s, a) r(s, a) + γ s p(s |s, a)V(s ) Bellman Equations (Bellman 1952) 11
  • 32. Value Functions The state value function: Vπ(s) = E ∞ t=0 γt rt|s0 = s = a∈As π(s, a) r(s, a) + γ s p(s |s, a)V(s ) The action value function: Qπ(s, a) = E ∞ t=0 γt rt|s0 = s, a0 = a Bellman Equations (Bellman 1952) 11
  • 33. Value Functions The state value function: Vπ(s) = E ∞ t=0 γt rt|s0 = s = a∈As π(s, a) r(s, a) + γ s p(s |s, a)V(s ) The action value function: Qπ(s, a) = E ∞ t=0 γt rt|s0 = s, a0 = a = r(s, a) + γ s p(s |s, a) a ∈A π(s , a )Q(s , a ) Bellman Equations (Bellman 1952) 11
  • 34. Value methods TD Learning: Qπ (s, a) ←Qπ (s, a) + α TD target r + γVπ (s ) −Qπ (s, a) TD error 12
  • 35. Value methods TD Learning: Qπ (s, a) ←Qπ (s, a) + α TD target r + γVπ (s ) −Qπ (s, a) TD error Vπ (s ) = max a Q(s , a) Q-Learning 12
  • 36. Value methods TD Learning: Qπ (s, a) ←Qπ (s, a) + α TD target r + γVπ (s ) −Qπ (s, a) TD error Vπ (s ) = max a Q(s , a) Q-Learning π(a|s) = argmax a Q(s, a) greedy policy 12
  • 37. Value methods TD Learning: Qπ (s, a) ←Qπ (s, a) + α TD target r + γVπ (s ) −Qπ (s, a) TD error Vπ (s ) = max a Q(s , a) Q-Learning π(a|s) = argmax a Q(s, a) greedy policy π(a|s) = random with probability argmaxa Q(s, a) with probability 1 − - greedy policy 12
  • 38. Option Value Functions The option value function: QΩ(s, ω) = a πω(a|s)QU (s, w, a) 13
  • 39. Option Value Functions The option value function: QΩ(s, ω) = a πω(a|s)QU (s, w, a) The action value function: QU (s, ω, a) = r(s, a) + γ s p(s |s, a)U(ω, s ) 13
  • 40. Option Value Functions The option value function: QΩ(s, ω) = a πω(a|s)QU (s, w, a) The action value function: QU (s, ω, a) = r(s, a) + γ s p(s |s, a)U(ω, s ) The state value function upon arrival: U(ω, s ) = (1 − βω(s ))QΩ(s , ω) + βω(s )VΩ(s ) 13
  • 41. Policy Gradient Methods π(a|s) = argmax a Q(s, a) 14
  • 42. Policy Gradient Methods π(a|s) = argmax a Q(s, a) vs. π(a|s, θ) = softmax a h(s, a, θ) 14
  • 43. Policy Gradient Methods π(a|s) = argmax a Q(s, a) vs. π(a|s, θ) = softmax a h(s, a, θ) J(θ) 14
  • 44. Policy Gradient Methods π(a|s) = argmax a Q(s, a) vs. π(a|s, θ) = softmax a h(s, a, θ) J(θ) = V∗ (s) 14
  • 45. Policy Gradient Methods π(a|s) = argmax a Q(s, a) vs. π(a|s, θ) = softmax a h(s, a, θ) J(θ) = V∗ (s) θJ(θ) 14
  • 46. Policy Gradient Methods π(a|s) = argmax a Q(s, a) vs. π(a|s, θ) = softmax a h(s, a, θ) J(θ) = V∗ (s) θJ(θ) Policy Gradient Theorem (Sutton, McAllester, et al. 2000) 14
  • 47. Policy Gradient Methods π(a|s) = argmax a Q(s, a) vs. π(a|s, θ) = softmax a h(s, a, θ) J(θ) = V∗ (s) θJ(θ) Policy Gradient Theorem (Sutton, McAllester, et al. 2000) θJ(θ) = s µπ(s) a Qπ(s, a) θπ(a|s, θ) 14
  • 48. Policy Gradient Methods π(a|s) = argmax a Q(s, a) vs. π(a|s, θ) = softmax a h(s, a, θ) J(θ) = V∗ (s) θJ(θ) Policy Gradient Theorem (Sutton, McAllester, et al. 2000) θJ(θ) = s µπ(s) a Qπ(s, a) θπ(a|s, θ) θJ(θ) = E γt a Qπ(s, a) θπ(a|s, θ) 14
  • 49. Actor-Critic (Sutton 1984) θ ← θ + αγt δ TD-Error θ log π(a|s, θ) Taken from Pierre-Luc Bacon 15
  • 50. Option-Critic (Bacon et al. 2017) Taken from (Bacon et al. 2017) 16
  • 51. Option-Critic (Bacon et al. 2017) The gradient wrt. intra-option-policy parameters θ: θQΩ(s, ω) take better primitives inside options. 17
  • 52. Option-Critic (Bacon et al. 2017) The gradient wrt. intra-option-policy parameters θ: θQΩ(s, ω) = E ∂ log πω,θ(a|s) ∂θ QU (s, ω, a) take better primitives inside options. 17
  • 53. Option-Critic (Bacon et al. 2017) The gradient wrt. intra-option-policy parameters θ: θQΩ(s, ω) = E ∂ log πω,θ(a|s) ∂θ QU (s, ω, a) take better primitives inside options. The gradient wrt. termination-policy parameters ϑ: ϑU(ω, s ) shorten options with bad advantage. 17
  • 54. Option-Critic (Bacon et al. 2017) The gradient wrt. intra-option-policy parameters θ: θQΩ(s, ω) = E ∂ log πω,θ(a|s) ∂θ QU (s, ω, a) take better primitives inside options. The gradient wrt. termination-policy parameters ϑ: ϑU(ω, s ) = E − ∂βω,ϑ(s ) ∂ϑ QΩ(s , ω) − VΩ(s , ω) e.g. maxω Q(s,w) shorten options with bad advantage. 17
  • 56. 18
  • 57. 19
  • 58. 20
  • 59. 20
  • 60. Complex Environment i (Bacon et al. 2017) 21
  • 61. Complex Environment ii (Harb et al. 2017) 22
  • 65. • Sutton & Barto, Reinforcement Learning: An Introduction, Second Edition Draft • David Silver’s Reinforcement Learning Course 25
  • 66. References i Bacon, Pierre-Luc, Jean Harb, and Doina Precup (2017). “The Option-Critic architecture”. In: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA. Pp. 1726–1734. Bellman, Richard (1952). “On the theory of dynamic programming”. In: Proceedings of the National Academy of Sciences 38.8, pp. 716–719. doi: 10.1073/pnas.38.8.716. Dilokthanakul, N., C. Kaplanis, N. Pawlowski, and M. Shanahan (2017). “Feature Control as Intrinsic Motivation for Hierarchical Reinforcement Learning”. In: ArXiv e-prints. arXiv: 1705.06769 [cs.LG]. 26
  • 67. References ii Harb, Jean, Pierre-Luc Bacon, Martin Klissarov, and Doina Precup (2017). “When waiting is not an option: Learning options with a deliberation cost”. In: arXiv: 1709.04571. Sutton, Richard S (1984). “Temporal credit assignment in reinforcement learning”. AAI8410337. PhD thesis. Sutton, Richard S, David A McAllester, Satinder P Singh, and Yishay Mansour (2000). “Policy gradient methods for reinforcement learning with function approximation”. In: Advances in Neural Information Processing Systems, pp. 1057–1063. 27
  • 68. References iii Sutton, Richard S, Doina Precup, and Satinder Singh (1999). “Between MDPs and Semi-MDPs: A framework for temporal abstraction in reinforcement learning”. In: Artificial Intelligence 112.1-2, pp. 181–211. doi: 10.1016/S0004-3702(99)00052-1. 28
  • 70. Option-Critic (Bacon et al. 2017) i procedure train(α, NΩ) s ← s0 choose ω ∼ πΩ(ω|s) Option-policy repeat choose a ∼ πω,θ(a|s) Intra-option-policy take the action a in s, observe s and r 1. Options evaluation g ← r TD-Target if s is not terminal then g ← g + γ(1 − βω,ϑ(s ))QΩ(s , ω) + γ βω,ϑ(s ) max ω QΩ(s , ω) 29
  • 71. Option-Critic (Bacon et al. 2017) ii 2. Critic improvement δU ← g − QU (s, ω, a) QU (s, ω, a) ← QU (s, ω, a) + αU δU 3. Intra-option Q-learning δΩ ← g − QΩ(s, ω) QΩ(s, ω) ← QΩ(s, ω) + αΩδΩ 4. Options improvement θ ← θ + αθ ∂ log πω,θ(a|s) ∂θ QU (s, ω, a) ϑ ← ϑ + αϑ ∂βω,ϑ(s ) ∂ϑ (QΩ(s , ω) − maxω QΩ(s , ω) + ξ) 30
  • 72. Option-Critic (Bacon et al. 2017) iii if terminate ∼ βω,ϑ(s ) then Termination-policy choose ω ∼ πΩ(ω|s ) s ← s . until s is terminal 31