Survey of Modern Reinforcement Learning

Survey of Modern
Reinforcement Learning
Julia Maddalena

What to expect from this talk
Part 1 Introduce the foundations of reinforcement learning
● Definitions and basic ideas
● A couple algorithms that work in simple environments
Part 2 Review some state-of-the-art methods
● Higher level concepts, vanilla methods
● Not a complete list of cutting edge methods
Part 3 Current state of reinforcement learning

Part 1
Foundations of reinforcement learning

What is reinforcement learning?
A type of machine learning where
an agent interacts with an
environment and learns to take
actions that result in greater
cumulative reward.
X alone is analyzed for patterns
● PCA
● Cluster analysis
● Outlier detection
X is used to predict Y
● Classification
● Regression
Supervised Learning Unsupervised Learning Reinforcement Learning

Definitions
Reward
Motivation for the agent. Not always obvious what the reward
signal should be
YOU WIN! +1
GAME OVER -1
Stay alive
+1/second
(sort of)
Agent
The learner and decision maker
Environment
Everything external to the agent used to
make decisions
Actions
The set of possible steps the agent can take
depending on the state of the environment

The Problem with Rewards...
Designing reward functions is notoriously difficult
Clark, Jack. “Faulty Reward Functions in the Wild.” OpenAI, 21 Dec. 2016.
1 Irpan, Alex. “Deep Reinforcement Learning Doesn't Work Yet.” Sorta Insightful, 14 Feb. 2018.
Possible reward structure
● Total points
● Time to finish
● Finishing position
Human player
“I’ve taken to imagining deep RL as a
demon that’s deliberately misinterpreting
your reward and actively searching for the
laziest possible local optima.”
- Alex Irpan
Reinforcement Learning Agent

More Definitions
Return
Long-term, discounted reward
Value
Expected return
value of states → V(s)
how good is it to be in state s
value of state-action pairs → Q(s,a)
how good is it to take action a from state s
discount factor
Policy
How the agent should act from a given state → π(a|s)

Markov Decision Process
Markov Process
A random process whose future behavior only depends on the current state.
Sleepy
Energetic
Hungry
70%
50%
15%
70%
50%
35%
50%

Markov Decision Process
Sleepy Energetic Hungry
nap beg
be good
beg
be good
30%
70% 20%
60%
20%
50%
60%
40%
60%
40%
10%
40%
Markov Process + Actions + Reward = Markov Decision Process
+2
-1
-2
-1
+10
+7
-2
-1
+10
+5
-6
-4

To model or not to model
Model-based methods
Transition Model ● We already know the dynamics of the environment
● We simply need to plan our actions to optimize return
Model-free methods
We don’t know or care about the dynamics, we just want to learn a good policy by
exploring the environment
Sample Model ● We don’t know the dynamics
● We try to learn them by exploring the environment
and use them to plan our actions to optimize return
Planning
Learning
Planning and
Learning

Reinforcement Learning Methods
Model-based Model-free
Transition
Model
Sample
Model
Dynamic
Programming

Bellman Equations
Value of each states under optimal policy for Robodog:
Bellman Equation
Bellman Optimality Equation
policy
transition
probabilities
value of the
next state
reward discount factor
value of the
current state

Policy Iteration
Policy evaluation
Makes the value function
consistent with the current policy
Policy improvement
Make the policy greedy with
respect to the current value
function
be good
beg
be good
beg
100%
50%
50%
50%
50%
sleepy nap
energetic
hungry
be good
beg
be good
beg
100%
0%
100%
100%
0%
sleepy nap
energetic
hungry
state value
sleepy 19.88
energetic 20.97
hungry 20.63
state value
sleepy 29.66
energetic 31.66
hungry 31.90
Converge to
optimal policy
and value under
optimal policy

Transition
Model On-policy
Off-policy
Sample
Model
Value-based Policy-based
Dynamic
Programming
Sarsa
Q-Learning
Monte Carlo
Temporal Difference

When learning happens
Monte Carlo: wait until end of episode before making updates to value estimates
X X
O
X
O
X
X O
O
X
X O
O
X X
X O
O O
X X
X O
O O
X X X
Update value for all
states in episode
X X
O
X
O
X
X O
O
X
Update
value for
previous
state
Temporal difference, TD(0): update every step using estimates of next states
bootstrapping
Update
value for
previous
state
Update
value for
previous
state
Update
value for
previous
state
. . .
in this example, learning = updating value of states

Exploration vs exploitation
𝜀-greedy policy
Exploration vs Exploitation, Will Evans, slideshare.net/willevans
exploitation exploration

Sarsa
S A Q(S, A)
sleepy nap 0
energetic beg 0
energetic be good 0
hungry beg 0
hungry be good 0
Energetic
be good
Hungry
+5
beg
S
A
R
S’
A’
Hungry
beg
Initialize Q(s,a)
For each episode:
• Start in a random state, S.
• Choose action A from S using 𝛆-greedy policy from Q(s,a).
• While S is not terminal:
1. Take action A, observe reward R and new state S’.
2. Choose action A’ from S’ using 𝛆-greedy policy from Q(s,a).
3. Update Q for state S and action A:
4. S ← S’, A ← A’
0.5

Q-Learning
S A Q(S, A)
sleepy nap 0
energetic beg 0
energetic be good 0
hungry beg 0
hungry be good 0
Energetic
be good
Hungry
+5
beg
S
A
R
S’
3. Update Q for state S and action A:
be good
Q = -1 Q = 2
Hungry
Initialize Q(s,a)
For each episode:
1.Choose action A from S using 𝛆-greedy policy from Q(s,a).
2.Take action A, observe reward R and new state S’.
4. S ← S’
beg
0.5

Part 2
State-of-the-art methods

Transition
Model On-policy
Off-policy
Sample
Model
Dynamic
Programming
Dyna-Q
Sarsa
Q-Learning
Monte Carlo
Temporal Difference

Dyna-Q
For each episode:
1.Choose action A from S using 𝛆-greedy policy from Q(s,a).
2.Take action A, observe reward R and new state S’.
3.Update Q for state S and action A:
Model(S, A)
S A Q(S, A) R S’
sleepy nap 0 0 NA
energetic beg 0 0 NA
energetic be good 0 0 NA
hungry beg 0 0 NA
hungry be good 0 0 NA
Energetic
be good
Hungry
+5
ordinary Q-Learning
Hungry
beg
Sleepy
-6
R
R
R
⋮
5 hungry0.5
R
R
R
⋮
Initialize Q(s,a) and Model(s,a)
4. Update Model for state S and action A:
5. “Hallucinate” n transitions and use them to update Q:
0.951.3551.7195

Deep Reinforcement Learning
2
3
4
5
6
7
8
Black
Box
state, s
Q(s, a)
for each action a
1
s a Q(s,a)
1 X Q(1, X)
1 Y Q(1, Y)
1 Z Q(1, Z)
2 X Q(2, X)
2 Y Q(2, Y)
2 Z Q(2, Z)
3 X Q(3, X)
3 Y Q(3, Y)
3 Z Q(3, Z)
4 X Q(4, X)
4 Y Q(4, Y)
4 Z Q(4, Z)
5 X Q(5, X)
5 Y Q(5, Y)
5 Z Q(5, Z)
6 X Q(6, X)
6 Y Q(6, Y)
6 Z Q(6, Z)
7 X Q(7, X)
7 Y Q(7, Y)
7 Z Q(7, Z)
8 X Q(8, X)
8 Y Q(8, Y)
8 Z Q(8, Z)
Q(s,X)
Q(s,Y)
Q(s,Z)

Transition
Model On-policy
Off-policy
Sample
Model
Dynamic
Programming
Dyna-Q
Monte Carlo
Tree Search
Sarsa
Q-Learning
Deep Q Networks*
Monte Carlo Methods
Temporal Difference Methods
* Utilize deep learning

Deep Q Networks (DQN)
Black Box
X
blank
O
state, s
Q(s, a)
for each action a
X
X
O
X
how good is it
to take this
action from
this state?
1. Initialize network.
1. Take one action under Q policy.
s a r s’
1 s1 a1 r1 s2
2 s2 a2 r2 s3
... ... ... ...
t st at rt st
3. Add new information to training data:
4. Use stochastic gradient descent
to update weights based on:
Repeat steps 2 - 4 until convergence
ŷ y

Deep Q Networks (DQN)
2. Take one action under Q policy.
3. Add new information to training data:
1. Use stochastic gradient descent to
update weights based on:
Problem:
● Data not i.i.d.
● Data collected based on an evolving policy, not the optimal
policy that we are trying to learn.
Solution:
Create a replay buffer of size k to take small samples from
Problem:
Instability introduced when updating Q(s, a) using Q(s’, a’)
Solution:
Have a secondary target network used to evaluate Q(s’, a’) and
only sync with primary network after every n training iterations
primary network target network
s a r s’
1 s1 a1 r1 s2
2 s2 a2 r2 s3
... ... ... ...
t st at rt st
s a r s’
1 s1 a1 r1 s2
2 s2 a2 r2 s3
... ... ... ...
t - k st-k at-k rt-k st-k+1
... ... ... ...
t st at rt st+1

Transition
Model On-policy
Off-policy
Sample
Model
Dynamic
Programming
Dyna-Q
Monte Carlo
Tree Search
Sarsa
Q-Learning
Deep Q Networks*
REINFORCE*
Monte Carlo Methods

REINFORCE
Black Box
X
blank
O
state, s
𝛑(a|s)
for each action a
X
X
O
X
what is the
probability of
taking this
action under
policy 𝛑?
r1 r2 r3 r4 r5 r6 r7 r8
2. Play out a full episode under 𝛑.
3. For every step t, calculate return
from that state until the end:
4. Use stochastic gradient descent
to update weights based on:
Repeat steps 2 - 4 until convergence

DQN vs REINFORCE
DQN REINFORCE
Learning Off-policy On-policy
Updates Temporal difference Monte Carlo
Output Q(s,a) ➝ Value-based 𝛑(a|s) ➝ Policy-based
Action spaces Small discrete only Large discrete or continuous
Exploration 𝛆-greedy Built-in due to stochastic policy
Convergence Slower to converge Faster to converge
Experience Less experience needed More experience needed

Transition
Model On-policy
Off-policy
Sample
Model
Dynamic
Programming
Dyna-Q
Monte Carlo
Tree Search
Sarsa
Q-Learning
Deep Q Networks*
REINFORCE*
Monte Carlo Methods
Advantage Actor-Critic*

Q Actor-Critic
Common
layers
X
blank
O
state, s
Policy
net
Value
net
𝛑(a|s)
for each action a
Q(s,a)
for each action a
Actor
Policy-based like REINFORCE
but can now use temporal
difference learning
Critic
Value-based, works sort of like
DQN

Quick review
Q-Learning DQN REINFORCE Q Actor-Critic A2C
Ability to
generalize
values in state
space
Ability to control in
continuous action
spaces using
stochastic policy
One step updates Reduce variability
in gradients

Advantage vs action value
A(S, A)
Q(S, A)
V(S)
Q(S, A)
V(S)
advantage

Advantage Actor-Critic (A2C)
Common
layers
X
blank
O
state, s
Policy
net
Value
net
𝛑(a|s)
for each action a
V(s)
Actor
Policy-based like REINFORCE
Can now use temporal difference
learning and baseline:
Critic
Value-based, now learns value of
states instead of state-action pairs

Part 3
Current state of reinforcement learning

Current state of reinforcement learning
Mostly in academia or research-focused companies, e.g. DeepMind, OpenAI
● Most impressive progress has been made in games
1 Irpan, Alex. “Deep Reinforcement Learning Doesn't Work Yet.” Sorta Insightful, 14 Feb. 2018.
Barriers to entry:
● Too much real-world experience required
Driverless car, robotics, etc. still largely not using RL.
“The rule-of-thumb is that except in rare cases, domain-specific algorithms work faster and
better than reinforcement learning.”1
“Reinforcement learning is a type of machine learning whose hunger
for data is even greater than supervised learning. It is really difficult to
get enough data for reinforcement learning algorithms. There’s more
work to be done to translate this to businesses and practice.”
- Andrew Ng
● Simulation is often not realistic enough
● Poor convergence properties
● There has not been enough development in transfer learning for RL models
○ Models do not generalize well outside of what they are trained on

Promising applications of RL (that aren’t games)
Energy Finance
Healthcare
Some aspects of
robotics
NLP
Computer
systems
Traffic light
control
Assisting GANs
Neural network
architecture
Computer vision
Education
Recommendation
systems
Science & Math

References
Clark, Jack. “Faulty Reward Functions in the Wild.” OpenAI, 21 Dec. 2016.
Friedman, Lex (2015). MIT: Introduction to Deep Reinforcement Learning. https://www.youtube.com/watch?v=zR11FLZ-O9M
Fullstack Academy (2017). Monte Carlo Tree Search Tutorial. https://www.youtube.com/watch?v=Fbs4lnGLS8M
Irpan, Alex. “Deep Reinforcement Learning Doesn't Work Yet.” Sorta Insightful, 14 Feb. 2018.
Lapan, M. (2018). Deep reinforcement learning hands-on: Apply modern RL methods, with deep Q-networks, value
iteration, policy gradients, TRPO, AlphaGo Zero and more. Birmingham, UK: Packt Publishing.
Silver, David (2015). University College London Reinforcement Learning Course. Lecture 7: Policy Gradient Methods
Towards Data Science. “Applications of Reinforcement Learning in Real World”, 1 Aug 2018.
Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. Cambridge, MA: The MIT Press.

Survey of Modern Reinforcement Learning

Recommended

Recommended

More Related Content

What's hot

What's hot (13)

Similar to Survey of Modern Reinforcement Learning

Similar to Survey of Modern Reinforcement Learning (20)

Recently uploaded

Recently uploaded (20)

Survey of Modern Reinforcement Learning

Editor's Notes