Proximal Policy Optimization (Reinforcement Learning)

Thom Lane
14th December 2018
Proximal Policy Optimization
Open AI in 2017

Policy Approximation
Learn actions. Optionally values too.
i.e. learn the policy directly,
rather than indirectly via values (e.g. Q-values).
Actor Critic methods learn both.
Works for discrete and continuous action space.
Can learn probabilities (with a softmax output)
for discrete actions.
Can learn distribution parameters (mean and st. dev)
for continuous actions.
Can be easier to learn in some cases (e.g. Tetris).

Stochastic Policies
Discrete Action Spaces
Often use Categorical Distribution.
Continuous Action Spaces
Often use Diagonal Gaussian Distribution.
Standard deviation can depend on state.
Use log to remove >0 constraint.
Can learn exploration directly in the policy
Smooths learning

Stochastic Policies can be optimal

Policy Gradients
! " = $%&
(())
∇ ! " ∝ -
.
/ ( -
0
1% (, 3 ∇45 3 (
∇ ! " = 6% -
0
1% (, 3 ∇45 3 (
∇ ! " = 6% 1%((, 37)
∇45 37 (
45 37 (
∇ ! " = 6% 87
∇45 37 (
45 37 (
∇ ! " = 87
∇45 37 (
45 37 (

Policy Gradients
! " = $%&
(())
∇ ! " = ,-
∇./ 0- (
./ 0- (
"-12 ← "- + 5∇ ! "
"-12 ← "- + 5,-
∇./ 0- (
./ 0- (
Gradient Assent
∇ ! " = ,- ∇ log(./ 0- ( ) because ∇ log 9 =
∇:
:

Policy Gradients
!"#$ ← !" + '("
∇*+ ," -
*+ ," -

Policy Gradients
!"#$ ← !" + '("
∇*+ ," -
*+ ," -
!.
!/
∇*+ 0123 -Action Space: left and right
State: A
Sampled action: left
Return: +2
Probability of action: 75%

Policy Gradients
!"#$ ← !" + '("
∇*+ ," -
*+ ," -
!.
!/
Action Space: left and right
State: A
Return: +2
2∇*+ 1234 -

Policy Gradients
!"#$ ← !" + '("
∇*+ ," -
*+ ," -
!.
!/
State: A
Return: +2
2
∇*+ 1234 -
0.75

Policy Gradients
!"#$ ← !" + '("
∇*+ ," -
*+ ," -
!.
!/
State: B
Sampled action: right
Return: -1
2
∇*+ 1234 -
0.75
∇*+ 9:;ℎ4 -

Policy Gradients
!"#$ ← !" + '("
∇*+ ," -
*+ ," -
!.
!/
State: B
Return: -1
2
∇*+ 1234 -
0.75
−1∇*+ ;<=ℎ4 -

Policy Gradients
!"#$ ← !" + '("
∇*+ ," -
*+ ," -
!.
!/
State: B
Return: -1
2
∇*+ 1234 -
0.75
−1
∇*+ ;<=ℎ4 -
0.5

Problem #1: Updates affect trajectories
Start with ‘on-policy’ learning.
We calculate gradient using trajectories collected from the current policy.
After a single update, we’re already ‘off-policy’.
We calculate gradient using trajectories collected from the previous policy.
Gradient calculation is only valid for ‘on-policy’ trajectories.
Can’t keep updating without collecting new trajectories.
Similar to ‘overfitting’
We might see loss fall if keep training using old data, but…
Very likely to perform badly in the next rollouts. So ignore the loss!

Solution #1: Importance Sampling

Problem #2: Similar values
Actions usually have similar expected returns from a given state.
Convoluted path taken in parameter space using this gradient.
!"
!#
$%
$&
$'
Action Space: $%, $& and $'
State: C
Sampled Action*: $%, $& and $'
Returns: 0.8, 1.0 and 1.2
Probability of action: equal
* assume we’ve seen State C 3 times in the current trajectories.

Solution #2: Advantage
Shift from absolute, to relative. Compared to what was expected.
!" #, % = '" #, % − )"(#)
,-
,.
%/
%0
%1
Action Space: %/, %0 and %1
State: C
Sampled Action*: %/, %0 and %1
Returns: 0.8, 1.0 and 1.2
Probability of action: equal (i.e. 1/3)
Estimated State Value: 1.1
Advantages: -0.3, -0.1 and 0.1

Actor Critic Methods
!" #, % = '" #, % − )"(#)
Advantage Actor Critic (A2C)
1) Use ) as a baseline
2) Use ) to bootstrap ' with 1-step TD
,-./ ← ,- + 2!-
∇45 %- #
45 %- #
6-./ ← 6- + 2[8- + 9:;< #=
− :;< # ]∇:;< #
Actor
Observation
Action
probabilities
State
value
Critic

Trust Region Policy Optimization (TPRO)
max
$
%& '&())+&
,-./012 23 %& 45[7$89:
; ,& , 7$(; |,&)] ≤ @
'& ) =
7$(B&|,&)
7$89:
(B&|,&)

Proximal Policy Optimization (PPO)
!" # =
%&(("|*")
%&,-.
(("|*")
max
&
2" !"(#)3"
max
&
2" min(!" # 3", 789: !" # , 1 − =, 1 + ? 3")

Proximal Policy Optimization (PPO)

Multiple losses in PPO
!"#$% & − ()!*+ & + (-.[01](45)
Clipped Policy Gradient
Value Error
Entropy

Advantages
Sample efficient
A little bit, can do multiple steps on single batch.
Which would usually break on policy rules.
Stable: due to clipping
Ease of implementation.

Proximal Policy Optimization (Reinforcement Learning)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Proximal Policy Optimization (Reinforcement Learning)

Similar to Proximal Policy Optimization (Reinforcement Learning) (20)

Recently uploaded

Recently uploaded (20)

Proximal Policy Optimization (Reinforcement Learning)