11. Reinforcement Learning
MDP: S, A, p(s |s, a), r(s, a)
Policy π(a|s) : S × A → [0, 1]
Goal: an optimal policy π∗ that maximizes E
∞
t=0
γt
rt|s0 = s
3
29. Option-Critic (Bacon et al. 2017)
Given the number of options Option-Critic learns βω, πω, πΩ in
an end-to-end & online fashion.
Allows non-linear function approximators, enabling
continuous state and action spaces.
• online, end-to-end learning of options in continuous
state/action space
• allows using non-linear function approximators (deep RL)
10
31. Value Functions
The state value function:
Vπ(s) = E
∞
t=0
γt
rt|s0 = s
=
a∈As
π(s, a) r(s, a) + γ
s
p(s |s, a)V(s )
Bellman Equations (Bellman 1952)
11
32. Value Functions
The state value function:
Vπ(s) = E
∞
t=0
γt
rt|s0 = s
=
a∈As
π(s, a) r(s, a) + γ
s
p(s |s, a)V(s )
The action value function:
Qπ(s, a) = E
∞
t=0
γt
rt|s0 = s, a0 = a
Bellman Equations (Bellman 1952)
11
33. Value Functions
The state value function:
Vπ(s) = E
∞
t=0
γt
rt|s0 = s
=
a∈As
π(s, a) r(s, a) + γ
s
p(s |s, a)V(s )
The action value function:
Qπ(s, a) = E
∞
t=0
γt
rt|s0 = s, a0 = a
= r(s, a) + γ
s
p(s |s, a)
a ∈A
π(s , a )Q(s , a )
Bellman Equations (Bellman 1952)
11
35. Value methods
TD Learning:
Qπ
(s, a) ←Qπ
(s, a) + α
TD target
r + γVπ
(s ) −Qπ
(s, a)
TD error
Vπ
(s ) = max
a
Q(s , a) Q-Learning
12
36. Value methods
TD Learning:
Qπ
(s, a) ←Qπ
(s, a) + α
TD target
r + γVπ
(s ) −Qπ
(s, a)
TD error
Vπ
(s ) = max
a
Q(s , a) Q-Learning
π(a|s) = argmax
a
Q(s, a) greedy policy
12
37. Value methods
TD Learning:
Qπ
(s, a) ←Qπ
(s, a) + α
TD target
r + γVπ
(s ) −Qπ
(s, a)
TD error
Vπ
(s ) = max
a
Q(s , a) Q-Learning
π(a|s) = argmax
a
Q(s, a) greedy policy
π(a|s) =
random with probability
argmaxa Q(s, a) with probability 1 −
- greedy policy
12
39. Option Value Functions
The option value function:
QΩ(s, ω) =
a
πω(a|s)QU (s, w, a)
The action value function:
QU (s, ω, a) = r(s, a) + γ
s
p(s |s, a)U(ω, s )
13
40. Option Value Functions
The option value function:
QΩ(s, ω) =
a
πω(a|s)QU (s, w, a)
The action value function:
QU (s, ω, a) = r(s, a) + γ
s
p(s |s, a)U(ω, s )
The state value function upon arrival:
U(ω, s ) = (1 − βω(s ))QΩ(s , ω) + βω(s )VΩ(s )
13
46. Policy Gradient Methods
π(a|s) = argmax
a
Q(s, a) vs. π(a|s, θ) = softmax
a
h(s, a, θ)
J(θ) = V∗
(s)
θJ(θ)
Policy Gradient Theorem (Sutton, McAllester, et al. 2000)
14
47. Policy Gradient Methods
π(a|s) = argmax
a
Q(s, a) vs. π(a|s, θ) = softmax
a
h(s, a, θ)
J(θ) = V∗
(s)
θJ(θ)
Policy Gradient Theorem (Sutton, McAllester, et al. 2000)
θJ(θ) =
s
µπ(s)
a
Qπ(s, a) θπ(a|s, θ)
14
48. Policy Gradient Methods
π(a|s) = argmax
a
Q(s, a) vs. π(a|s, θ) = softmax
a
h(s, a, θ)
J(θ) = V∗
(s)
θJ(θ)
Policy Gradient Theorem (Sutton, McAllester, et al. 2000)
θJ(θ) =
s
µπ(s)
a
Qπ(s, a) θπ(a|s, θ)
θJ(θ) = E γt
a
Qπ(s, a) θπ(a|s, θ)
14
51. Option-Critic (Bacon et al. 2017)
The gradient wrt. intra-option-policy parameters θ:
θQΩ(s, ω)
take better primitives inside options.
17
52. Option-Critic (Bacon et al. 2017)
The gradient wrt. intra-option-policy parameters θ:
θQΩ(s, ω) = E
∂ log πω,θ(a|s)
∂θ
QU (s, ω, a)
take better primitives inside options.
17
53. Option-Critic (Bacon et al. 2017)
The gradient wrt. intra-option-policy parameters θ:
θQΩ(s, ω) = E
∂ log πω,θ(a|s)
∂θ
QU (s, ω, a)
take better primitives inside options.
The gradient wrt. termination-policy parameters ϑ:
ϑU(ω, s )
shorten options with bad advantage.
17
54. Option-Critic (Bacon et al. 2017)
The gradient wrt. intra-option-policy parameters θ:
θQΩ(s, ω) = E
∂ log πω,θ(a|s)
∂θ
QU (s, ω, a)
take better primitives inside options.
The gradient wrt. termination-policy parameters ϑ:
ϑU(ω, s ) = E −
∂βω,ϑ(s )
∂ϑ
QΩ(s , ω) − VΩ(s , ω)
e.g. maxω Q(s,w)
shorten options with bad advantage.
17
65. • Sutton & Barto, Reinforcement Learning: An Introduction,
Second Edition Draft
• David Silver’s Reinforcement Learning Course
25
66. References i
Bacon, Pierre-Luc, Jean Harb, and Doina Precup (2017). “The
Option-Critic architecture”. In: Proceedings of the Thirty-First
AAAI Conference on Artificial Intelligence, February 4-9, 2017,
San Francisco, California, USA. Pp. 1726–1734.
Bellman, Richard (1952). “On the theory of dynamic
programming”. In: Proceedings of the National Academy of
Sciences 38.8, pp. 716–719. doi: 10.1073/pnas.38.8.716.
Dilokthanakul, N., C. Kaplanis, N. Pawlowski, and M. Shanahan
(2017). “Feature Control as Intrinsic Motivation for Hierarchical
Reinforcement Learning”. In: ArXiv e-prints. arXiv:
1705.06769 [cs.LG].
26
67. References ii
Harb, Jean, Pierre-Luc Bacon, Martin Klissarov, and Doina Precup
(2017). “When waiting is not an option: Learning options with a
deliberation cost”. In: arXiv: 1709.04571.
Sutton, Richard S (1984). “Temporal credit assignment in
reinforcement learning”. AAI8410337. PhD thesis.
Sutton, Richard S, David A McAllester, Satinder P Singh, and
Yishay Mansour (2000). “Policy gradient methods for
reinforcement learning with function approximation”. In:
Advances in Neural Information Processing Systems,
pp. 1057–1063.
27
68. References iii
Sutton, Richard S, Doina Precup, and Satinder Singh (1999).
“Between MDPs and Semi-MDPs: A framework for temporal
abstraction in reinforcement learning”. In: Artificial
Intelligence 112.1-2, pp. 181–211. doi:
10.1016/S0004-3702(99)00052-1.
28
70. Option-Critic (Bacon et al. 2017) i
procedure train(α, NΩ)
s ← s0
choose ω ∼ πΩ(ω|s) Option-policy
repeat
choose a ∼ πω,θ(a|s) Intra-option-policy
take the action a in s, observe s and r
1. Options evaluation
g ← r TD-Target
if s is not terminal then
g ← g + γ(1 − βω,ϑ(s ))QΩ(s , ω)
+ γ βω,ϑ(s ) max
ω
QΩ(s , ω)
29