Maximum Entropy Reinforcement Learning (Stochastic Control)

1
Maximum Entropy Reinforcement Learning
Stochastic Control
T. Haarnoja, et al., “Reinforcement Learning with Deep Energy-Based Policies”, ICML
2017
T. Haarnoja, et, al., “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement
Learning with a Stochastic Actor”, ICML 2018
T. Haarnoja, et, al., “Soft Actor-Critic Algorithms and Applications”, arXiv preprint 2018
Dongmin Lee
August, 2019
T. Haarnoja, et al., “Reinforcement Learning with Deep Energy-Based
Policies”, ICML 2017
T. Haarnoja, et, al., “Soft Actor-Critic: Off-Policy Maximum Entropy
Deep Reinforcement Learning with a Stochastic Actor”, ICML 2018
T. Haarnoja, et, al., “Soft Actor-Critic Algorithms and Applications”,
arXiv preprint 2018
Presented by Dongmin Lee
August, 2019

2
Outline
1. Introduction
2. Reinforcement Learning
• Markov Decision Processes (MDPs)
• Bellman Equation
3. Exploration methods
4. Maximum Entropy Reinforcement Learning
• Soft MDPs
• Soft Bellman Equation
5. From Soft Policy Iteration to Soft Actor-Critic
• Soft Policy Iteration
• Soft Actor-Critic
6. Results

3
Introduction
How do we build intelligent systems / machines?
• Data à Information
• Information à Decision
intelligence
intelligence

Intelligent systems must be able to adapt to uncertainty
• Uncertainty:
• Other cars
• Weather
• Road construction
• Passenger
4
Introduction

RL provides a formalism for behaviors
• Problem of a goal-directed agent interacting with an uncertain environment
• Interaction à adaptation
5
feedback & decision
Introduction

6
Introduction
Distinct Features of RL
1. There is no supervisor, only a reward signal
2. Feedback is delayed, not instantaneous
3. Time really matters (sequential, non i.i.d data)
4. Tradeoff between exploration and exploitation

What is deep RL?
Deep RL = RL + Deep learning
• RL: sequential decision making interacting with environments
• Deep learning: representation of policies and value functions
7source : http://rail.eecs.berkeley.edu/deeprlcourse-fa17/f17docs/lecture_2.pdf
Introduction

What are the challenges of deep RL?
• Huge # of samples: millions
• Fast, stable learning
• Hyperparameter tuning
• Exploration
• Sparse reward signals
• Safety / reliability
• Simulator
8
Introduction

• Soft Actor-Critic on Minitaur Training
9
Introduction
source : https://sites.google.com/view/sac-and-applications

• Soft Actor-Critic on Minitaur Testing
10
Introduction

• Interfered rollouts from Soft Actor-Critic policy trained for Dynamixel Claw task from vision
11
Introduction

• State of the environment / system
• Action determined by the agent: affects the state
• Reward: evaluates the benefit of state and action
12
Reinforcement Learning

• State: 𝑠" ∈ 𝑆
• Action: 𝑎" ∈ 𝐴
• Policy (mapping from state to action): 𝜋
• Deterministic
𝜋 𝑠" = 𝑎"
• Stochastic (randomized)
𝜋 𝑎 𝑠 = Prob(𝑎" = 𝑎|𝑠" = 𝑠)
à Policy is what we want to optimize!
13
Reinforcement Learning

A Markov decision process (MDP) is a tuple < 𝑆, 𝐴, 𝑝, 𝑟, 𝛾 >, consisting of
• 𝑆: set of states (state space)
e.g., 𝑆 = 1, … , 𝑛 (discrete), 𝑆 = ℝ:
(continuous)
• 𝐴: set of actions (action space)
e.g., 𝐴 = 1, … , 𝑚 (discrete), 𝐴 = ℝ<
(continuous)
• 𝑝: state transition probability
𝑝 𝑠= 𝑠, 𝑎 ≜ Prob 𝑠"?@ = 𝑠= 𝑠" = 𝑠, 𝑎" = 𝑎 d
• 𝑟: reward function
𝑟 𝑠", 𝑎" = 𝑟"d
• 𝛾 ∈ (0,1]: discount factor
14
Markov Decision Processes (MDPs)

The MDP problem
• To find an optimal policy that maximizes the expected cumulative reward:
max
F∈∏
𝔼F I
"JK
L
𝛾" 𝑟 𝑠", 𝑎"
Value function
• The state value function 𝑉F
(𝑠) of a policy 𝜋 is the expected return starting from state 𝑠 under
executing 𝜋:
𝑉F
𝑠 ≜ 𝔼F I
NJ"
L
𝛾NO"
𝑟 𝑠N, 𝑎N |𝑠" = 𝑠
Q-function
• The action value function 𝑄F(𝑠, 𝑎) of a policy 𝜋 is the expected return starting from state 𝑠, taking
action 𝑎, and then following 𝜋:
𝑄F
𝑠, 𝑎 ≜ 𝔼F I
NJ"
L
𝛾NO"
𝑟 𝑠N, 𝑎N |𝑠" = 𝑠, 𝑎" = 𝑎
15

The MDP problem
max
F∈∏
𝔼F I
"JK
L
𝛾"
𝑟 𝑠", 𝑎"
Optimal value function
• The optimal value function 𝑉∗(𝑠) is the maximum value function over all policies:
𝑉∗ 𝑠 ≜ max
F∈∏
𝑉F 𝑠 = max
F∈∏
𝔼F I
"JK
L
𝛾" 𝑟 𝑠", 𝑎" |𝑠" = 𝑠
Optimal Q-function
• The optimal Q-function 𝑄∗(𝑠, 𝑎) is the maximum Q-function over all policies:
𝑄∗ 𝑠, 𝑎 ≜ max
F∈∏
𝑄F 𝑠, 𝑎
16

Bellman expectation equation for 𝑉F
• The value function 𝑉F is the unique solution to the following Bellman equation:
𝑉F 𝑠 = 𝔼 𝑟 𝑠", 𝑎" + 𝛾𝑉F 𝑠"?@ |𝑠" = 𝑠
= I
S∈T
𝜋(𝑎|𝑠) 𝑟 𝑠, 𝑎 + 𝛾 I
UV∈W
𝑝 𝑠=
𝑠, 𝑎 𝑉F
𝑠=
= I
S∈T
𝜋(𝑎|𝑠)𝑄F
𝑠, 𝑎
• In operator form
𝑉F = 𝑇𝑉F
Bellman expectation equation for 𝑄F
• The Q-function 𝑄F satisfies:
𝑄F 𝑠, 𝑎 = 𝔼 𝑟 𝑠", 𝑎" + 𝛾𝑉F 𝑠"?@ |𝑠" = 𝑠, 𝑎" = 𝑎
= 𝑟 𝑠, 𝑎 + 𝛾 I
UV∈W
𝑝(𝑠=|𝑠, 𝑎)𝑉F 𝑠=
= 𝑟 𝑠, 𝑎 + 𝛾 I
UV∈W
𝑝 𝑠=
𝑠, 𝑎 I
SV∈T
𝜋 𝑎=
𝑠=
𝑄F
𝑠=
, 𝑎=
17
Bellman Equation

Bellman optimality equation for 𝑉∗
• The optimal value function 𝑉∗
must satisfies the self-consistency condition given by the Bellman
equation for 𝑉F:
𝑉∗ 𝑠 = max
SY∈T
𝔼 𝑟 𝑠", 𝑎" + 𝛾𝑉∗ 𝑠"?@ |𝑠" = 𝑠
= max
S∈T
𝑟 𝑠, 𝑎 + 𝛾 I
UV∈W
𝑝(𝑠=|𝑠, 𝑎)𝑉∗ 𝑠= = max
S∈T
𝑄∗ 𝑠, 𝑎
• In operator form
𝑉∗
= 𝑇𝑉∗
Bellman optimality equation for 𝑄∗
• The optimal Q-function 𝑄∗ satisfies:
𝑄∗
𝑠, 𝑎 = 𝔼 𝑟 𝑠", 𝑎" + 𝛾𝑉∗
𝑠"?@ |𝑠" = 𝑠, 𝑎" = 𝑎
= 𝑟 𝑠, 𝑎 + 𝛾 I
UV∈W
𝑝(𝑠=|𝑠, 𝑎)𝑉∗ 𝑠=
= 𝑟 𝑠, 𝑎 + 𝛾 I
UV∈W
𝑝(𝑠=|𝑠, 𝑎) max
SV∈T
𝑄∗ 𝑠=, 𝑎=
18
Bellman Equation

The MDP problem
max
F∈∏
𝔼F I
"JK
L
𝛾" 𝑟 𝑠", 𝑎"
Known model:
• Policy iteration
• Value iteration
Unknown model:
• Temporal-difference learning
• Q-learning
19
Bellman Equation

Policy Iteration
Initialize a policy 𝜋K;
1. (Policy Evaluation) Compute the value 𝑉F[ of 𝜋
by solving the Bellman equation 𝑉F[ = 𝑇F[ 𝑉F[;
2. (Policy Improvement) Update the policy to 𝜋?@ so that
𝜋?@(𝑠) ∈ arg max
S
{𝑟 𝑠, 𝑎 + 𝛾 I
UV∈W
𝑝 𝑠= 𝑠, 𝑎 𝑉F[ 𝑠= }
3. Set 𝑘 ← 𝑘 + 1; Repeat until convergence;
• Converged policy is optimal!
• Q) Can we do something even simpler?
20
Bellman Equation

Value Iteration
Initialize a value function 𝑉K;
1. Update the value function by
𝑉?@ ← 𝑇𝑉;
2. Set 𝑘 ← 𝑘 + 1; Repeat until convergence;
• Converged value function is optimal!
21
Bellman Equation

Common issue: exploration
• Is there a value of exploring unknown regions of the environment?
• Which action should we try to explore unknown regions of the environment?
22
Exploration methods

How can we algorithmically solve this issue in exploration vs exploitation?
1. 𝜖-Greedy
2. Optimism in the face of uncertainty
3. Thompson (posterior) sampling
4. Noisy networks
5. Information theoretic exploration
23
Exploration methods

Method 1: 𝜖-Greedy
Idea:
• Choose a greedy action (arg max 𝑄F(𝑠, 𝑎)) with probability (1 − 𝜖)
• Choose a random action with probability 𝜖
Advantages:
1. Very simple
2. Light computation
Disadvantages:
1. Fine tuning 𝜖
2. Not quite systematic: No focus on unexplored regions
3. Inefficient
24
Exploration methods

Method 2: Optimism in the face of uncertainty (OFU)
Idea:
1. Construct a confidence set for MDP parameters using collected samples
2. Choose MDP parameters that give the highest rewards
3. Construct an optimal policy of the optimistically chosen MDP
Advantages:
1. Regret optimal
2. Systematic: More focus on unexplored regions
Disadvantages:
1. Complicated
2. Computation intensive
25
Exploration methods

Method 3: Thompson (posterior) sampling
Idea:
1. Sample MDP parameters from posterior distribution
2. Construct an optimal policy of the sampled MDP parameters
3. Update the posterior distribution
Advantages:
1. Regret optimal
2. Systematic: More focus on unexplored regions
Disadvantages:
1. Somewhat complicated
26
Exploration methods

Method 4: Noisy networks
Idea:
• Inject noise to the weights of Neural Network
Advantages:
1. Simple
2. Good empirical performance
Disadvantages:
1. Not systematic: No focus on unexplored regions
27
Exploration methods

Method 5: Information theoretic exploration
Idea:
• Use high entropy policy to explore more
Advantages:
1. Simple
2. Good empirical performance
Disadvantages:
1. More theoretic analyses needed
28
Exploration methods

What is entropy?
• Average rate at which information is produced by a stochastic source of data:
𝐻 𝑝 ≜ − I
e
𝑝e log 𝑝e = 𝔼 − log 𝑝e
• More generally, entropy refers to disorder
29

Maximum entropy stochastic control
Standard MDP problem:
max
F
𝔼F I
"
𝛾" 𝑟(𝑠", 𝑎")
Maximum entropy MDP problem:
max
F
𝔼F I
"
𝛾" 𝑟 𝑠", 𝑎" + 𝐻 𝜋" ⋅ 𝑠"
31
Soft MDPs

Theorem 1 (Soft value functions and Optimal policy)
Soft Q-function:
𝑄ijkl
F
𝑠", 𝑎" ≜ 𝑟 𝑠", 𝑎" + 𝔼F I
mJ@
L
𝛾m 𝑟 𝑠"?m, 𝑎"?m + 𝐻 𝜋 ⋅ 𝑠"?m
Soft value function:
𝑉ijkl
F
𝑠" ≜ log n exp 𝑄ijkl
F
𝑠", 𝑎 𝑑𝑎
Optimal value functions:
𝑄ijkl
∗
𝑠", 𝑎" ≜ max
F
𝑄ijkl
F
𝑠", 𝑎"
𝑉ijkl
∗
∗
32
Soft Bellman Equation

Proof.
Maximum entropy MDP problem:
max
F
𝔼F I
"
𝛾"
𝑟 𝑠", 𝑎" + 𝐻 𝜋" ⋅ 𝑠"
𝑉ijkl
F
𝑠 = 𝔼F I
NJ"
𝛾NO" 𝑟 𝑠N, 𝑎N − log 𝜋 𝑎N 𝑠N |𝑠" = 𝑠
= 𝔼F 𝑟 𝑠", 𝑎" − log 𝜋 𝑎" 𝑠" + 𝛾𝑉ijkl
F
𝑠"?@ |𝑠" = 𝑠
Soft Q-function:
𝑄ijkl
F
s, 𝑎 = 𝔼F I
NJ"
𝛾NO"
𝑟 𝑠N, 𝑎N − log 𝜋 𝑎N 𝑠N |𝑠" = 𝑠, 𝑎" = 𝑎
= 𝔼F 𝑟 𝑠", 𝑎" + 𝛾𝑉ijkl
F
𝑠"?@ |𝑠" = 𝑠, 𝑎" = 𝑎
33
note that 𝑄ijkl
F
s, 𝑎 is the same with the original form

Proof.
𝑉ijkl
F
𝑠 = 𝔼F I
NJ"
𝛾NO" 𝑟 𝑠N, 𝑎N − log 𝜋 𝑎N 𝑠N |𝑠" = 𝑠
= 𝔼F 𝑟 𝑠", 𝑎" − log 𝜋 𝑎" 𝑠" + 𝛾𝑉ijkl
F
𝑠"?@ |𝑠" = 𝑠
Soft Q-function:
𝑄ijkl
F
s, 𝑎 = 𝔼F I
NJ"
𝛾NO"
𝑟 𝑠N, 𝑎N − log 𝜋 𝑎N 𝑠N |𝑠" = 𝑠, 𝑎" = 𝑎
= 𝔼F 𝑟 𝑠", 𝑎" + 𝛾𝑉ijkl
F
𝑠"?@ |𝑠" = 𝑠, 𝑎" = 𝑎
Soft value function (rewrite):
𝑉ijkl
F
𝑠 = 𝔼F 𝑟 𝑠", 𝑎" − log 𝜋 𝑎" 𝑠" + 𝛾𝑉ijkl
F
𝑠"?@ |𝑠" = 𝑠
= 𝔼F 𝑄ijkl
F
𝑠", 𝑎" − log 𝜋 𝑎" 𝑠" |𝑠" = 𝑠
34

Proof.
Given a policy 𝜋jst, define a new policy 𝜋uvw as:
𝜋uvw 𝑎 𝑠 ∝ exp 𝑄ijkl
Fyz{
s, 𝑎
𝜋uvw 𝑎 𝑠 =
exp(𝑄ijkl
Fyz{
s, 𝑎 )
∫ exp(𝑄ijkl
Fyz{
s, 𝑎= ) 𝑑𝑎=
= softmax
S∈T
𝑄ijkl
Fyz{
s, 𝑎
= exp(𝑄ijkl
Fyz{
s, 𝑎 − 𝑉ijkl
Fyz{
𝑠 ) = exp(𝐴ijkl
Fyz{
s, 𝑎 )
Substitute 𝜋uvw 𝑎 𝑠 into 𝑉ijkl
F
𝑠 :
𝑉ijkl
F
𝑠 = 𝔼S~F 𝑄ijkl
F
s, 𝑎 − log
exp 𝑄ijkl
F
s, 𝑎
∫ exp 𝑄ijkl
F
s, 𝑎= 𝑑𝑎=
= 𝔼S~F 𝑄ijkl
F
s, 𝑎 − log exp 𝑄ijkl
F
s, 𝑎 + log n exp 𝑄ijkl
F
s, 𝑎=
𝑑𝑎=
35

Proof.
Substitute 𝜋uvw 𝑎 𝑠 into 𝑉ijkl
F
𝑠 :
𝑉ijkl
F
𝑠 = 𝔼S~F 𝑄ijkl
F
s, 𝑎 − log exp 𝑄ijkl
F
s, 𝑎 + log n exp 𝑄ijkl
F
s, 𝑎=
𝑑𝑎=
= 𝔼S~F log n exp 𝑄ijkl
F
s, 𝑎=
𝑑𝑎=
= log n exp 𝑄ijkl
F
s, 𝑎= 𝑑𝑎=
∴ 𝑉ijkl
F
𝑠 = log n exp 𝑄ijkl
F
s, 𝑎 𝑑𝑎
36
independent from 𝑎

Theorem 1 (Soft value functions and Optimal policy)
Soft Q-function:
𝑄ijkl
F
𝑠", 𝑎" ≜ 𝑟 𝑠", 𝑎" + 𝔼F I
mJ@
L
𝛾m 𝑟 𝑠"?m, 𝑎"?m + 𝐻 𝜋 ⋅ 𝑠"?m
𝑉ijkl
F
F
Optimal value functions:
𝑄ijkl
∗
𝑠", 𝑎" ≜ max
F
𝑄ijkl
F
𝑠", 𝑎"
𝑉ijkl
∗
∗
37

Theorem 2 (Soft policy improvement theorem)
Fyz{
s, 𝑎
Assume that throughout our computation, 𝑄 is bounded and ∫ exp 𝑄 𝑠, 𝑎 𝑑𝑎 is bounded for any 𝑠 (for
both 𝜋jst and 𝜋uvw).
Then,
𝑄ijkl
F€•‚
s, 𝑎 ≥ 𝑄ijkl
Fyz{
s, 𝑎 ∀𝑠, 𝑎
à Monotonic policy improvement
38

39
Proof.
If one greedily maximize the sum of entropy and value with one-step look-ahead, then one obtains 𝜋uvw
from 𝜋jst:
𝐻 𝜋jst ⋅ 𝑠 + 𝔼Fyz{
𝑄ijkl
Fyz{
s, 𝑎 ≤ 𝐻 𝜋uvw ⋅ 𝑠 + 𝔼F€•‚
𝑄ijkl
Fyz{
s, 𝑎
Example
𝜋(𝑎|𝑠)
𝑟(𝑠, 𝑎)
𝑠K
𝑠@
𝑠†
𝑠‡
𝑠ˆ
𝜋jst
0.5
0.5
0.5
0.5
+2
+4+1
+1
𝑠•
𝑠Ž
0.5
+3
0.5
+1

40
Proof.
from 𝜋jst:
𝑄ijkl
Fyz{
𝑄ijkl
Fyz{
s, 𝑎
Example
𝑠K
𝑠@
𝑠†
𝑠‡
𝑠ˆ
𝜋jst
0.5
0.5
0.5
0.5
+2
+4+1
+1
𝑠•
𝑠Ž
0.5
+3
0.5
+1
𝑄ijkl
Fyz{
left = ?
𝑄ijkl
Fyz{
right = ?
𝑄ijkl
Fyz{
left = ?
𝑄ijkl
Fyz{
right = ?
𝑄ijkl
Fyz{
left = ?
𝑄ijkl
Fyz{
right =?
Soft Q-function:
𝑄ijkl
F
𝑠, 𝑎 ≜ 𝑟 𝑠", 𝑎" + 𝔼F I
mJ@
L
𝛾m
𝑟 𝑠"?m, 𝑎"?m + 𝐻 𝜋 ⋅ 𝑠"?m
= 𝑟 𝑠, 𝑎 + 𝛾𝑝 𝑠=
𝑠, 𝑎 𝑉ijkl
F
𝑠=
𝑉ijkl
F
𝑠 = I
S
𝜋 𝑎 𝑠 𝑄ijkl
F
s, 𝑎 − log 𝜋 𝑎 𝑠
Entropy of policy:
𝐻 𝜋(⋅ |𝑠) ≜ − I
S
𝜋(𝑎|𝑠) log 𝜋(𝑎|𝑠)

41
Proof.
from 𝜋jst:
𝑄ijkl
Fyz{
𝑄ijkl
Fyz{
s, 𝑎
Example
𝑠K
𝑠@
𝑠†
𝑠‡
𝑠ˆ
𝜋jst (suppose γ = 1, 𝑝 𝑠=
𝑠, 𝑎 = 1)
0.5
0.5
0.5
0.5
+2
+4+1
+1
𝑠•
𝑠Ž
0.5
+3
0.5
+1
𝑄ijkl
Fyz{
left = 2
𝑄ijkl
Fyz{
right = 4
𝑄ijkl
Fyz{
left = 3
𝑄ijkl
Fyz{
right = 1
𝑄ijkl
Fyz{
left = 4.3
𝑄ijkl
Fyz{
right = 3.3
Soft Q-function:
𝑄ijkl
F
𝑠, 𝑎 ≜ 𝑟 𝑠", 𝑎" + 𝔼F I
mJ@
L
𝛾m
𝑟 𝑠"?m, 𝑎"?m + 𝐻 𝜋 ⋅ 𝑠"?m
= 𝑟 𝑠, 𝑎 + 𝛾𝑝 𝑠=
𝑠, 𝑎 𝑉ijkl
F
𝑠=
𝑉ijkl
F
𝑠 = I
S
𝜋 𝑎 𝑠 𝑄ijkl
F
s, 𝑎 − log 𝜋 𝑎 𝑠
Entropy of policy:
𝐻 𝜋(⋅ |𝑠) ≜ − I
S
𝜋(𝑎|𝑠) log 𝜋(𝑎|𝑠)

42
Proof.
from 𝜋jst:
𝑄ijkl
Fyz{
𝑄ijkl
Fyz{
s, 𝑎
Example
𝑠K
𝑠@
𝑠†
𝑠‡
𝑠ˆ
𝜋jst
0.5
0.5
0.5
0.5
+2
+4+1
+1
𝑠•
𝑠Ž
0.5
+3
0.5
+1
A new policy 𝜋uvw:
Fyz{
s, 𝑎
𝜋uvw 𝑎 𝑠 =
exp(𝑄ijkl
Fyz{
s, 𝑎 )
∑SV exp(𝑄ijkl
Fyz{
s, 𝑎= )
= softmax
S∈T
𝑄ijkl
Fyz{
s, 𝑎
= exp(𝑄ijkl
Fyz{
Fyz{
𝑠 )
𝑄ijkl
Fyz{
left = 2
𝑄ijkl
Fyz{
right = 4
𝑄ijkl
Fyz{
left = 3
𝑄ijkl
Fyz{
right = 1
𝑄ijkl
Fyz{
left = 4.3
𝑄ijkl
Fyz{
right = 3.3

43
Proof.
from 𝜋jst:
𝑄ijkl
Fyz{
𝑄ijkl
Fyz{
s, 𝑎
Example
𝑠K
𝑠@
𝑠†
𝑠‡
𝑠ˆ
𝜋jst
0.5
0.5
0.5
0.5
+2
+4+1
+1
𝑠•
𝑠Ž
0.5
+3
0.5
+1
𝑄ijkl
Fyz{
left = 2
𝑄ijkl
Fyz{
right = 4
𝑄ijkl
Fyz{
left = 3
𝑄ijkl
Fyz{
right = 1
𝑄ijkl
Fyz{
left = 4.3
𝑄ijkl
Fyz{
right = 3.3
𝑠K
𝑠@
𝑠†
𝑠‡
𝑠ˆ
𝜋uvw 𝜋uvw 𝑎 𝑠 =
v•– —˜y™š
›yz{
i,S
∑
œV v•– —˜y™š
›yz{ i,SV
0.73
0.27
0.12
0.88
+2
+4+1
+1
𝑠•
𝑠Ž
0.88
+3
0.12
+1

44
Proof.
from 𝜋jst:
𝑄ijkl
Fyz{
𝑄ijkl
Fyz{
s, 𝑎
So, Substitute 𝑠K, 𝑎K for 𝑠, 𝑎:
𝐻 𝜋jst ⋅ 𝑠K + 𝔼Fyz{
𝑄ijkl
Fyz{
𝑠K, 𝑎K ≤ 𝐻 𝜋uvw ⋅ 𝑠K + 𝔼F€•‚
𝑄ijkl
Fyz{
𝑠K, 𝑎K
0.3 + 3.8 ≤ 0.25 + 4.03
4.1 ≤ 4.28

45
Proof.
from 𝜋jst:
𝑄ijkl
Fyz{
𝑄ijkl
Fyz{
s, 𝑎
Then, we can show that:
𝑄ijkl
Fyz{
s, 𝑎 = 𝔼UŸ
𝑟K + 𝛾 𝐻 𝜋jst ⋅ 𝑠@ + 𝔼SŸ~Fyz{
𝑄ijkl
F
𝑠@, 𝑎@
≤ 𝔼UŸ
𝑟K + 𝛾 𝐻 𝜋uvw ⋅ 𝑠@ + 𝔼SŸ~F€•‚
𝑄ijkl
F
𝑠@, 𝑎@
= 𝔼UŸ
𝑟K + 𝛾 𝐻 𝜋uvw ⋅ 𝑠@ + 𝑟@ + 𝛾† 𝔼U 𝐻 𝜋jst ⋅ 𝑠† + 𝔼S ~Fyz{
𝑄ijkl
F
𝑠†, 𝑎†
≤ 𝔼UŸ
𝑟K + 𝛾 𝐻 𝜋uvw ⋅ 𝑠@ + 𝑟@ + 𝛾† 𝔼U 𝐻 𝜋uvw ⋅ 𝑠† + 𝔼S ~F€•‚
𝑄ijkl
F
𝑠†, 𝑎†
= 𝔼UŸ,S ~F€•‚,U 𝑟K + 𝛾 𝐻 𝜋uvw ⋅ 𝑠@ + 𝑟@ + 𝛾† 𝐻 𝜋uvw ⋅ 𝑠† + 𝑟† + 𝛾‡ 𝔼U¡
𝐻 𝜋uvw ⋅ 𝑠‡ + 𝔼S¡~F€•‚
𝑄ijkl
F
𝑠‡, 𝑎‡
⋮
≤ 𝔼N~F€•‚
𝑟K + ∑"J@
L 𝛾" 𝐻 𝜋uvw ⋅ 𝑠" + 𝑟"
= 𝑄ijkl
F€•‚
s, 𝑎

Theorem 2 (Soft policy improvement theorem)
Fyz{
s, 𝑎
Assume that throughout our computation, 𝑄 is bounded and ∫ exp 𝑄 𝑠, 𝑎 𝑑𝑎 is bounded for any 𝑠 (for
both 𝜋jst and 𝜋uvw).
Then,
𝑄ijkl
F€•‚
s, 𝑎 ≥ 𝑄ijkl
Fyz{
s, 𝑎 ∀𝑠, 𝑎
à Monotonic policy improvement
46

Comparison to standard MDP: Bellman expectation equation
Standard:
𝑄F
𝑠, 𝑎 = 𝔼 𝑟 𝑠", 𝑎" + 𝛾𝑉F
𝑠"?@ |𝑠" = 𝑠, 𝑎" = 𝑎
= 𝑟 𝑠, 𝑎 + 𝛾 ∑UV∈W 𝑝(𝑠=
|𝑠, 𝑎)𝑉F
𝑠=
𝑉F
𝑠 = ∑S∈T 𝜋(𝑎|𝑠)𝑄F
𝑠, 𝑎
Maximum entropy:
𝑄ijkl
F
s, 𝑎 = 𝔼F 𝑟 𝑠", 𝑎" + 𝛾𝑉ijkl
F
𝑠"?@ |𝑠" = 𝑠, 𝑎" = 𝑎
= 𝑟 𝑠, 𝑎 + 𝛾 ∑UV∈W 𝑝(𝑠=|𝑠, 𝑎)𝑉ijkl
F
𝑠=
𝑉ijkl
F
𝑠 = 𝔼F 𝑄ijkl
F
𝑠", 𝑎" − log 𝜋 𝑎" 𝑠" |𝑠" = 𝑠
= log ∫ exp 𝑄ijkl
F
s, 𝑎 𝑑𝑎
Maximum entropy variant:
• Add explicit temperature parameter 𝛼
𝑉ijkl
F
F
𝑠", 𝑎" − 𝛼 log 𝜋 𝑎" 𝑠" |𝑠" = 𝑠
= 𝛼 log ∫ exp
@
¤
𝑄ijkl
F
s, 𝑎 𝑑𝑎
47

Comparison to standard MDP: Bellman optimality equation
Standard:
𝑄∗
𝑠, 𝑎 = 𝔼 𝑟 𝑠", 𝑎" + 𝛾𝑉∗
𝑠"?@ |𝑠" = 𝑠, 𝑎" = 𝑎
= 𝑟 𝑠, 𝑎 + 𝛾 ∑UV∈W 𝑝(𝑠=|𝑠, 𝑎)𝑉∗ 𝑠=
𝑉∗ 𝑠 = max
S∈T
𝑄∗ 𝑠, 𝑎
Maximum entropy variant:
𝑄ijkl
∗
s, 𝑎 = 𝔼F 𝑟 𝑠", 𝑎" + 𝛾𝑉ijkl
∗
𝑠"?@ |𝑠" = 𝑠, 𝑎" = 𝑎
= 𝑟 𝑠, 𝑎 + 𝛾 ∑UV∈W 𝑝(𝑠=|𝑠, 𝑎)𝑉ijkl
∗
𝑠=
𝑉ijkl
∗
∗
𝑠", 𝑎" − 𝛼 log 𝜋 𝑎" 𝑠" |𝑠" = 𝑠
= 𝛼 log ∫ exp
@
¤
𝑄ijkl
∗
s, 𝑎 𝑑𝑎
Connection:
• Note that as 𝛼 → 0, exp
@
¤
𝑄ijkl
∗
s, 𝑎 emphasizes max
S∈T
𝑄∗ 𝑠, 𝑎 :
𝛼 log ∫ exp
@
¤
𝑄ijkl
∗
s, 𝑎 𝑑𝑎 → max
S∈T
𝑄∗ 𝑠, 𝑎 as 𝛼 → 0
48

Comparison to standard MDP: Policy
Standard:
𝜋uvw 𝑎 𝑠 ∈ arg max
S
𝑄Fyz{ 𝑠, 𝑎
à Deterministic policy
Maximum Entropy:
𝜋uvw 𝑎 𝑠 =
exp
1
𝛼
𝑄ijkl
Fyz{
s, 𝑎
∫ exp
1
𝛼
𝑄ijkl
Fyz{
s, 𝑎= 𝑑𝑎=
= exp
1
𝛼
𝑄ijkl
Fyz{
Fyz{
𝑠
à Stochastic policy
49

So, what’s the advantage of maximum entropy?
• Computation:
No maximization involved
• Exploration:
High entropy policy
• Structural similarity:
Can combine it with many RL methods for standard MDP
50

Idea:
• Maximum entropy + off-policy actor-critic
Advantages:
• Exploration
• Sample efficiency
• Stable convergence
• Little hyperparameter tuning
51
From Soft Policy Iteration to Soft Actor-Critic
Version 1 (Jan 4 2018) Version 2 (Dec 13 2018)

Soft policy evaluation
Modified Bellman operator:
𝑇F
𝑄F
𝑠, 𝑎 ≜ 𝑟 𝑠, 𝑎 + 𝛾 I
UV∈W
𝑝(𝑠=
|𝑠, 𝑎)𝑉ijkl
F
𝑠=
where 𝑉F 𝑠 = 𝔼F 𝑄ijkl
F
s, 𝑎 − 𝛼 log 𝜋 𝑎 𝑠
Repeatedly apply 𝑇F
to 𝑄:
𝑄?@ ← 𝑇F 𝑄
• Result: 𝑄 converges to 𝑄F!
52
Soft Policy Iteration

Soft policy improvement
If no constraint on policies,
𝜋uvw 𝑎 𝑠 = exp
1
𝛼
𝑄ijkl
Fyz{
Fyz{
𝑠
When constraint need to be satisfied (i.e., 𝜋 ∈ ∏),
Information projection:
𝜋uvw ⋅ 𝑠 ∈ arg min
FV ⋅ 𝑠 ∈∏
𝐷¨© 𝜋= ⋅ 𝑠 ∥ exp
1
𝛼
𝑄ijkl
Fyz{
s,⋅ − 𝑉ijkl
Fyz{
𝑠
• Result: Monotonic improvement on policy! (𝜋uvw better than 𝜋jst)
53
target policy
Soft Policy Iteration

Model-Free version of soft policy iteration: Soft actor-critic
Soft policy iteration: maximum entropy variant of policy iteration
Soft actor-critic (SAC): maximum entropy variant of actor-critic
• Soft critic:
evaluates soft Q-function 𝑄« of policy 𝜋¬
• Soft actor:
improves maximum entropy policy using critic’s evaluation
• Temperature parameter alpha:
automatically adjusts entropy for maximum entropy policy using alpha 𝛼
54
Soft Actor-Critic

Soft critic: evaluates soft Q-function 𝑄« of policy 𝜋¬
Idea: DQN + Maximum entropy
• Training soft Q-function:
min
«
𝐽— 𝜃 ≜ 𝔼 UY,SY
1
2
𝑄« 𝑠", 𝑎" − 𝑟 𝑠", 𝑎" + 𝛾𝔼UY¯Ÿ
𝑉«° 𝑠"?@
†
• Stochastic gradient:
±∇« 𝐽— 𝜃 = ∇« 𝑄« 𝑠", 𝑎" × 𝑄« 𝑠", 𝑎" − 𝑟 𝑠", 𝑎" + 𝛾 𝑄«° 𝑠"?@, 𝑎"?@ − 𝛼 log 𝜋¬ 𝑎"?@ 𝑠"?@
55
Soft Actor-Critic
target
sample estimate of 𝔼UY¯Ÿ
𝑉«° 𝑠"?@

Soft actor: updates policy using critic’s evaluation
Idea: Policy gradient + Maximum entropy
• Minimizing the expected KL-divergence:
min
¬
𝐷¨© 𝜋¬ ⋅ 𝑠" ∥ exp
1
𝛼
𝑄« 𝑠",⋅ − 𝑉« 𝑠"
which is equivalent to
min
¬
𝐽F 𝜙 ≜ 𝔼UY
𝔼SY~Fµ ⋅ 𝑠"
𝛼 log 𝜋¬ 𝑎" 𝑠" − 𝑄« 𝑠", 𝑎"
56
Soft Actor-Critic
target policy

Reparameterization trick
Reparameterize the policy as
𝑎" ≜ 𝑓¬ 𝜖"; 𝑠"
which 𝜖" is an input noise with some fixed distribution (e.g., Gaussian)
Benefit: lower variance
Rewrite the policy optimization problem as
min
¬
𝐽F 𝜙 ≜ 𝔼UY,·Y
𝛼 log 𝜋¬ 𝑓¬ 𝜖"; 𝑠" 𝑠" − 𝑄« 𝑠", 𝑓¬ 𝜖"; 𝑠"
Stochastic gradient:
±∇¬ 𝐽F 𝜙 = ∇¬ 𝛼 log 𝜋¬ 𝑎" 𝑠" + ∇SY
𝛼 log 𝜋¬ 𝑎" 𝑠" − ∇SY
𝑄« 𝑠", 𝑎" ×∇¬ 𝑓¬ 𝜖"; 𝑠"
which 𝑎" is evaluated at 𝑓¬ 𝜖"; 𝑠"
57
Soft Actor-Critic

Automating entropy adjustment
Why do we have to do automatic temperature parameter 𝛼 adjustment?
• Choosing the optimal temperature 𝛼"
∗ is non-trivial, and the temperature needs to be tuned for each task.
• Since the entropy 𝐻 𝜋" can vary unpredictably both across tasks and during training as the policy becomes
better, this makes the temperature adjustment particularly difficult.
• Thus, simply forcing the entropy to a fixed value is a poor solution, since the policy should be free to explore
more in regions.
58
Soft Actor-Critic

Constrained optimization problem suppose 𝛾 = 1 :
max
F¸,…,F¹
𝔼 I
"JK
º
𝑟(𝑠", 𝑎") s. t. 𝐻 𝜋" ≥ 𝐻K ∀𝑡
where 𝐻K is a desired minimum expected entropy threshold
Rewrite the objective as an iterated maximization employing an (approximate) dynamic programming
approach:
max
F¸
𝔼 𝑟 𝑠K, 𝑎K + max
FŸ
𝔼 … + max
F¹
𝔼 𝑟 𝑠º, 𝑎º
Subject to the constraint on entropy
Start from the last time step 𝑇:
max
F¹
𝔼 U¹,S¹ ~¼›
𝑟 𝑠º, 𝑎º
Subject to 𝔼 U¹,S¹ ~¼›
− log 𝜋º 𝑎º 𝑠º − 𝐻K ≥ 0
59
Soft Actor-Critic

Change the constrained maximization to the dual problem using Lagrangian:
1. Create the following function:
𝑓 𝜋º = ½
𝔼 U¹,S¹ ~¼›
𝑟 𝑠º, 𝑎º , s. t. ℎ(𝜋º) ≥ 0
−∞, otherwise
ℎ 𝜋º = 𝐻 𝜋º − 𝐻K = 𝔼 U¹,S¹ ~¼›
− log 𝜋º 𝑎º 𝑠º − 𝐻K
2. So, rewrite the objective as
max
F¹
𝑓 𝜋º s. t. ℎ 𝜋º ≥ 0
3. Use Lagrange multiplier:
min
¤¹ÁK
𝐿 𝜋º, 𝛼º = 𝑓 𝜋º + 𝛼ºℎ 𝜋º
where 𝛼º is the dual variable
4. Rewrite the objective using Lagrangian as
max
F¹
𝑓 𝜋º = min
¤¹ÁK
max
F¹
𝐿 𝜋º, 𝛼º = min
¤¹ÁK
max
F¹
𝑓 𝜋º + 𝛼ºℎ 𝜋º
60
Soft Actor-Critic

Change the constrained maximization to the dual problem using Lagrangian:
max
F¹
𝔼 U¹,S¹ ~¼›
𝑟 𝑠º, 𝑎º = max
F¹
𝑓 𝜋º
= min
¤¹ÁK
max
F¹
𝐿 𝜋º, 𝛼º = min
¤¹ÁK
max
F¹
𝑓 𝜋º + 𝛼ºℎ 𝜋º
= min
¤¹ÁK
max
F¹
𝔼 U¹,S¹ ~¼›
𝑟 𝑠º, 𝑎º + 𝛼º 𝔼 U¹,S¹ ~¼›
− log 𝜋º 𝑎º 𝑠º − 𝐻K
= min
¤¹ÁK
max
F¹
𝔼 U¹,S¹ ~¼›
𝑟 𝑠º, 𝑎º − 𝛼º log 𝜋º 𝑎º 𝑠º − 𝛼º 𝐻K
= min
¤¹ÁK
max
F¹
𝔼 U¹,S¹ ~¼›
𝑟 𝑠º, 𝑎º + 𝛼º 𝐻 𝜋º − 𝛼º 𝐻K
We can solve for
• Optimal policy 𝜋º
∗
as
𝜋º
∗ = arg max
F¹
𝔼 U¹,S¹ ~¼›
𝑟 𝑠º, 𝑎º + 𝛼º 𝐻 𝜋º − 𝛼º 𝐻K
• Optimal dual variable 𝛼º
∗
as
𝛼º
∗
= arg min
¤¹ÁK
𝔼 U¹,S¹ ~¼›∗ 𝛼º 𝐻 𝜋º
∗
− 𝛼º 𝐻K
Thus,
max
F¹
𝔼 𝑟 𝑠º, 𝑎º = 𝔼 U¹,S¹ ~¼›∗ 𝑟 𝑠º, 𝑎º + 𝛼º
∗ 𝐻 𝜋º
∗ − 𝛼º
∗ 𝐻K
61
Soft Actor-Critic

Then, the soft Q-function can be computed as
𝑄ºO@ 𝑠ºO@, 𝑎ºO@ = 𝑟 𝑠ºO@, 𝑎ºO@ + 𝔼 𝑄º 𝑠º, 𝑎º − 𝛼º log 𝜋º 𝑎º 𝑠º
= 𝑟 𝑠ºO@, 𝑎ºO@ + 𝔼 𝑟 𝑠º, 𝑎º + 𝛼º 𝐻 𝜋º
𝑄ºO@
∗
𝑠ºO@, 𝑎ºO@ = 𝑟 𝑠ºO@, 𝑎ºO@ + max
F¹
𝔼 𝑟 𝑠º, 𝑎º + 𝛼º
∗
𝐻(𝜋º
∗
)
Now, given the time step 𝑇 − 1, we have:
max
F¹°Ÿ
𝔼 𝑟 𝑠ºO@, 𝑎ºO@ + max
F¹
𝔼 𝑟 𝑠º, 𝑎º
= max
F¹°Ÿ
𝑄ºO@
∗
𝑠ºO@, 𝑎ºO@ − 𝛼º
∗
𝐻 𝜋º
∗
= min
¤¹°ŸÁK
max
F¹°Ÿ
𝑄ºO@
∗
𝑠ºO@, 𝑎ºO@ − 𝛼º
∗
𝐻 𝜋º
∗
+ 𝛼ºO@ 𝐻 𝜋ºO@ − 𝐻K
= min
¤¹°ŸÁK
max
F¹°Ÿ
𝑄ºO@
∗
𝑠ºO@, 𝑎ºO@ − 𝛼ºO@ 𝐻 𝜋ºO@ − 𝛼ºO@ 𝐻K − 𝛼º
∗
𝐻 𝜋º
∗
62
Soft Actor-Critic

We can solve for
• Optimal policy 𝜋ºO@
∗
as
𝜋ºO@
∗
= arg max
F¹°Ÿ
𝔼 U¹°Ÿ,S¹°Ÿ ~¼›
𝑄ºO@
∗
𝑠ºO@, 𝑎ºO@ + 𝛼ºO@ 𝐻 𝜋ºO@ − 𝛼ºO@ 𝐻K
• Optimal dual variable 𝛼ºO@
∗
as
𝛼ºO@
∗ = arg min
¤¹°ŸÁK
𝔼 U¹°Ÿ,S¹°Ÿ ~¼›∗ 𝛼ºO@ 𝐻 𝜋ºO@
∗ − 𝛼ºO@ 𝐻K
In this way, we can proceed backwards in time and recursively optimize constrained optimization
problem. Then, we can solve the optimal dual variable 𝛼"
∗
after solving for 𝑄"
∗
and 𝜋"
∗
:
𝛼"
∗
= arg min
¤Y
𝔼SY~FY
∗ −𝛼" log 𝜋"
∗
𝑎" 𝑠" − 𝛼" 𝐻K
63
Soft Actor-Critic

Temperature parameter alpha: automatically adjusts entropy for maximum entropy policy
using alpha 𝛼
Idea: Constrained optimization problem + Dual problem
• Minimizing the dual objective by approximating dual gradient descent:
min
¤
𝔼SY~FY
−𝛼 log 𝜋" 𝑎" 𝑠" − 𝛼𝐻K
64
Soft Actor-Critic

Putting everything together: Soft Actor-Critic (SAC)
65
Soft Actor-Critic

66
Results
• Soft actor-critic (blue and yellow) performs consistently across all tasks and outperforming both
on-policy and off-policy methods in the most challenging tasks.

67
References
Papers
• T. Haarnoja, et al., “Reinforcement Learning with Deep Energy-Based Policies”, ICML 2017
• T. Haarnoja, et, al., “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning
with a Stochastic Actor”, ICML 2018
• T. Haarnoja, et, al., “Soft Actor-Critic Algorithms and Applications”, arXiv preprint 2018
Theoretical analysis
• “Induction of Q-learning, Soft Q-learning, and Sparse Q-learning” written by Sungjoon Choi
• “Reinforcement Learning with Deep Energy-Based Policies” reviewed by Kyungjae Lee
• “Entropy Regularization in Reinforcement Learning (Stochastic Policy)” reviewed by Kyungjae Lee
• Reframing Control as an Inference Problem, CS 294-112: Deep Reinforcement Learning
• Constrained optimization problem and dual problem for SAC with automatically adjusted
temperature (in Korean)

Maximum Entropy Reinforcement Learning (Stochastic Control)

More Related Content

What's hot

Similar to Maximum Entropy Reinforcement Learning (Stochastic Control)

More from Dongmin Lee

Recently uploaded

Maximum Entropy Reinforcement Learning (Stochastic Control)