A review of the basic ideas and concepts in reinforcement learning, including discussion of Q-Learning and Sarsa methods. Includes a survey of modern RL methods, including Dyna-Q, DQN, REINFORCE, and AC2, and how they relate.
2. What to expect from this talk
Part 1 Introduce the foundations of reinforcement learning
● Definitions and basic ideas
● A couple algorithms that work in simple environments
Part 2 Review some state-of-the-art methods
● Higher level concepts, vanilla methods
● Not a complete list of cutting edge methods
Part 3 Current state of reinforcement learning
4. What is reinforcement learning?
A type of machine learning where
an agent interacts with an
environment and learns to take
actions that result in greater
cumulative reward.
X alone is analyzed for patterns
● PCA
● Cluster analysis
● Outlier detection
X is used to predict Y
● Classification
● Regression
Supervised Learning Unsupervised Learning Reinforcement Learning
5. Definitions
Reward
Motivation for the agent. Not always obvious what the reward
signal should be
YOU WIN! +1
GAME OVER -1
Stay alive
+1/second
(sort of)
Agent
The learner and decision maker
Environment
Everything external to the agent used to
make decisions
Actions
The set of possible steps the agent can take
depending on the state of the environment
6. The Problem with Rewards...
Designing reward functions is notoriously difficult
Clark, Jack. “Faulty Reward Functions in the Wild.” OpenAI, 21 Dec. 2016.
1 Irpan, Alex. “Deep Reinforcement Learning Doesn't Work Yet.” Sorta Insightful, 14 Feb. 2018.
Possible reward structure
● Total points
● Time to finish
● Finishing position
Human player
“I’ve taken to imagining deep RL as a
demon that’s deliberately misinterpreting
your reward and actively searching for the
laziest possible local optima.”
- Alex Irpan
Reinforcement Learning Agent
7. More Definitions
Return
Long-term, discounted reward
Value
Expected return
value of states → V(s)
how good is it to be in state s
value of state-action pairs → Q(s,a)
how good is it to take action a from state s
discount factor
Policy
How the agent should act from a given state → π(a|s)
8. Markov Decision Process
Markov Process
A random process whose future behavior only depends on the current state.
Sleepy
Energetic
Hungry
70%
50%
15%
70%
50%
35%
50%
9. Markov Decision Process
Sleepy Energetic Hungry
nap beg
be good
beg
be good
30%
70% 20%
60%
20%
50%
60%
40%
60%
40%
10%
40%
Markov Process + Actions + Reward = Markov Decision Process
+2
-1
-2
-1
+10
+7
-2
-1
+10
+5
-6
-4
10. To model or not to model
Model-based methods
Transition Model ● We already know the dynamics of the environment
● We simply need to plan our actions to optimize return
Model-free methods
We don’t know or care about the dynamics, we just want to learn a good policy by
exploring the environment
Sample Model ● We don’t know the dynamics
● We try to learn them by exploring the environment
and use them to plan our actions to optimize return
Planning
Learning
Planning and
Learning
12. Bellman Equations
Value of each states under optimal policy for Robodog:
Bellman Equation
Bellman Optimality Equation
policy
transition
probabilities
value of the
next state
reward discount factor
value of the
current state
13. Policy Iteration
Policy evaluation
Makes the value function
consistent with the current policy
Policy improvement
Make the policy greedy with
respect to the current value
function
be good
beg
be good
beg
100%
50%
50%
50%
50%
sleepy nap
energetic
hungry
be good
beg
be good
beg
100%
0%
100%
100%
0%
sleepy nap
energetic
hungry
state value
sleepy 19.88
energetic 20.97
hungry 20.63
state value
sleepy 29.66
energetic 31.66
hungry 31.90
Converge to
optimal policy
and value under
optimal policy
14. Reinforcement Learning Methods
Model-based Model-free
Transition
Model On-policy
Off-policy
Sample
Model
Value-based Policy-based
Dynamic
Programming
Sarsa
Q-Learning
Monte Carlo
Temporal Difference
15. When learning happens
Monte Carlo: wait until end of episode before making updates to value estimates
X X
O
X
O
X
X O
O
X
X O
O
X X
X O
O O
X X
X O
O O
X X X
Update value for all
states in episode
X X
O
X
O
X
X O
O
X
Update
value for
previous
state
Temporal difference, TD(0): update every step using estimates of next states
bootstrapping
Update
value for
previous
state
Update
value for
previous
state
Update
value for
previous
state
. . .
in this example, learning = updating value of states
17. Sarsa
S A Q(S, A)
sleepy nap 0
energetic beg 0
energetic be good 0
hungry beg 0
hungry be good 0
Energetic
be good
Hungry
+5
beg
S
A
R
S’
A’
Hungry
beg
Initialize Q(s,a)
For each episode:
• Start in a random state, S.
• Choose action A from S using 𝛆-greedy policy from Q(s,a).
• While S is not terminal:
1. Take action A, observe reward R and new state S’.
2. Choose action A’ from S’ using 𝛆-greedy policy from Q(s,a).
3. Update Q for state S and action A:
4. S ← S’, A ← A’
0.5
18. Q-Learning
S A Q(S, A)
sleepy nap 0
energetic beg 0
energetic be good 0
hungry beg 0
hungry be good 0
Energetic
be good
Hungry
+5
beg
S
A
R
S’
3. Update Q for state S and action A:
be good
Q = -1 Q = 2
Hungry
Initialize Q(s,a)
For each episode:
• Start in a random state, S.
• While S is not terminal:
1.Choose action A from S using 𝛆-greedy policy from Q(s,a).
2.Take action A, observe reward R and new state S’.
4. S ← S’
beg
0.5
20. Reinforcement Learning Methods
Model-based Model-free
Transition
Model On-policy
Off-policy
Sample
Model
Value-based Policy-based
Dynamic
Programming
Dyna-Q
Sarsa
Q-Learning
Monte Carlo
Temporal Difference
21. Dyna-Q
For each episode:
• Start in a random state, S.
• While S is not terminal:
1.Choose action A from S using 𝛆-greedy policy from Q(s,a).
2.Take action A, observe reward R and new state S’.
3.Update Q for state S and action A:
Model(S, A)
S A Q(S, A) R S’
sleepy nap 0 0 NA
energetic beg 0 0 NA
energetic be good 0 0 NA
hungry beg 0 0 NA
hungry be good 0 0 NA
Energetic
be good
Hungry
+5
ordinary Q-Learning
Hungry
beg
Sleepy
-6
R
R
R
⋮
5 hungry0.5
R
R
R
⋮
Initialize Q(s,a) and Model(s,a)
4. Update Model for state S and action A:
5. “Hallucinate” n transitions and use them to update Q:
0.951.3551.7195
23. Deep Reinforcement Learning
2
3
4
5
6
7
8
Black
Box
state, s
Q(s, a)
for each action a
1
s a Q(s,a)
1 X Q(1, X)
1 Y Q(1, Y)
1 Z Q(1, Z)
2 X Q(2, X)
2 Y Q(2, Y)
2 Z Q(2, Z)
3 X Q(3, X)
3 Y Q(3, Y)
3 Z Q(3, Z)
4 X Q(4, X)
4 Y Q(4, Y)
4 Z Q(4, Z)
5 X Q(5, X)
5 Y Q(5, Y)
5 Z Q(5, Z)
6 X Q(6, X)
6 Y Q(6, Y)
6 Z Q(6, Z)
7 X Q(7, X)
7 Y Q(7, Y)
7 Z Q(7, Z)
8 X Q(8, X)
8 Y Q(8, Y)
8 Z Q(8, Z)
Q(s,X)
Q(s,Y)
Q(s,Z)
24. Reinforcement Learning Methods
Model-based Model-free
Transition
Model On-policy
Off-policy
Sample
Model
Value-based Policy-based
Dynamic
Programming
Dyna-Q
Monte Carlo
Tree Search
Sarsa
Q-Learning
Deep Q Networks*
Monte Carlo Methods
Temporal Difference Methods
* Utilize deep learning
25. Deep Q Networks (DQN)
Black Box
X
blank
O
state, s
Q(s, a)
for each action a
X
X
O
X
how good is it
to take this
action from
this state?
1. Initialize network.
1. Take one action under Q policy.
s a r s’
1 s1 a1 r1 s2
2 s2 a2 r2 s3
... ... ... ...
t st at rt st
3. Add new information to training data:
4. Use stochastic gradient descent
to update weights based on:
Repeat steps 2 - 4 until convergence
ŷ y
26. Deep Q Networks (DQN)
1. Initialize network.
2. Take one action under Q policy.
3. Add new information to training data:
1. Use stochastic gradient descent to
update weights based on:
Problem:
● Data not i.i.d.
● Data collected based on an evolving policy, not the optimal
policy that we are trying to learn.
Solution:
Create a replay buffer of size k to take small samples from
Problem:
Instability introduced when updating Q(s, a) using Q(s’, a’)
Solution:
Have a secondary target network used to evaluate Q(s’, a’) and
only sync with primary network after every n training iterations
primary network target network
s a r s’
1 s1 a1 r1 s2
2 s2 a2 r2 s3
... ... ... ...
t st at rt st
s a r s’
1 s1 a1 r1 s2
2 s2 a2 r2 s3
... ... ... ...
t - k st-k at-k rt-k st-k+1
... ... ... ...
t st at rt st+1
27. Reinforcement Learning Methods
Model-based Model-free
Transition
Model On-policy
Off-policy
Sample
Model
Value-based Policy-based
Dynamic
Programming
Dyna-Q
Monte Carlo
Tree Search
Sarsa
Q-Learning
Deep Q Networks*
REINFORCE*
Monte Carlo Methods
Temporal Difference Methods
* Utilize deep learning
28. REINFORCE
Black Box
X
blank
O
state, s
𝛑(a|s)
for each action a
X
X
O
X
what is the
probability of
taking this
action under
policy 𝛑?
1. Initialize network.
r1 r2 r3 r4 r5 r6 r7 r8
2. Play out a full episode under 𝛑.
3. For every step t, calculate return
from that state until the end:
4. Use stochastic gradient descent
to update weights based on:
Repeat steps 2 - 4 until convergence
29. DQN vs REINFORCE
DQN REINFORCE
Learning Off-policy On-policy
Updates Temporal difference Monte Carlo
Output Q(s,a) ➝ Value-based 𝛑(a|s) ➝ Policy-based
Action spaces Small discrete only Large discrete or continuous
Exploration 𝛆-greedy Built-in due to stochastic policy
Convergence Slower to converge Faster to converge
Experience Less experience needed More experience needed
30. Reinforcement Learning Methods
Model-based Model-free
Transition
Model On-policy
Off-policy
Sample
Model
Value-based Policy-based
Dynamic
Programming
Dyna-Q
Monte Carlo
Tree Search
Sarsa
Q-Learning
Deep Q Networks*
REINFORCE*
Monte Carlo Methods
Temporal Difference Methods
* Utilize deep learning
Advantage Actor-Critic*
32. Quick review
Q-Learning DQN REINFORCE Q Actor-Critic A2C
Ability to
generalize
values in state
space
Ability to control in
continuous action
spaces using
stochastic policy
One step updates Reduce variability
in gradients
34. Advantage Actor-Critic (A2C)
Common
layers
X
blank
O
state, s
Policy
net
Value
net
𝛑(a|s)
for each action a
V(s)
Actor
Policy-based like REINFORCE
Can now use temporal difference
learning and baseline:
Critic
Value-based, now learns value of
states instead of state-action pairs
36. Current state of reinforcement learning
Mostly in academia or research-focused companies, e.g. DeepMind, OpenAI
● Most impressive progress has been made in games
1 Irpan, Alex. “Deep Reinforcement Learning Doesn't Work Yet.” Sorta Insightful, 14 Feb. 2018.
Barriers to entry:
● Too much real-world experience required
Driverless car, robotics, etc. still largely not using RL.
“The rule-of-thumb is that except in rare cases, domain-specific algorithms work faster and
better than reinforcement learning.”1
“Reinforcement learning is a type of machine learning whose hunger
for data is even greater than supervised learning. It is really difficult to
get enough data for reinforcement learning algorithms. There’s more
work to be done to translate this to businesses and practice.”
- Andrew Ng
● Simulation is often not realistic enough
● Poor convergence properties
● There has not been enough development in transfer learning for RL models
○ Models do not generalize well outside of what they are trained on
37. Promising applications of RL (that aren’t games)
Energy Finance
Healthcare
Some aspects of
robotics
NLP
Computer
systems
Traffic light
control
Assisting GANs
Neural network
architecture
Computer vision
Education
Recommendation
systems
Science & Math
38. References
Clark, Jack. “Faulty Reward Functions in the Wild.” OpenAI, 21 Dec. 2016.
Friedman, Lex (2015). MIT: Introduction to Deep Reinforcement Learning. https://www.youtube.com/watch?v=zR11FLZ-O9M
Fullstack Academy (2017). Monte Carlo Tree Search Tutorial. https://www.youtube.com/watch?v=Fbs4lnGLS8M
Irpan, Alex. “Deep Reinforcement Learning Doesn't Work Yet.” Sorta Insightful, 14 Feb. 2018.
Lapan, M. (2018). Deep reinforcement learning hands-on: Apply modern RL methods, with deep Q-networks, value
iteration, policy gradients, TRPO, AlphaGo Zero and more. Birmingham, UK: Packt Publishing.
Silver, David (2015). University College London Reinforcement Learning Course. Lecture 7: Policy Gradient Methods
Towards Data Science. “Applications of Reinforcement Learning in Real World”, 1 Aug 2018.
Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. Cambridge, MA: The MIT Press.
Editor's Notes
Environment → states of the environment
****Describe image!!****
Discount factor prevents infinite return
Value vs policy based methods
drama/sci-fi
State-action pair
Dynamics
Sample model - by learning and planning we are often able to do better than we would with just learning alone
Only applies to single agent fully observable MDPs
Reward can propagate backwards
Almost all reinforcement learning methods are well described as generalized policy iteration
Monte Carlo = low bias, high variance
Temporal difference methods = higher bias, lower variance (and don’t need complete episodes in order to learn)
Lower variance is often better!
Major consideration in all RL algorithms
Greedy action = action that we currently believe has the most value
Decrease epsilon over time
Now, we no longer no the dynamics of Robodog
What is the advantage/disadvantage of off-policy vs on-policy?
Q learning and Sarsa were developed in the late 80s. While not state-of-the-art as they only work for small state and action spaces, they laid the foundation for a some of modern reinforcement learning methods covered in Part 2
***Learning from experience can be expensive***
Not necessarily best system to use last observed reward and new state for our model
Q updates from sample moel get more interesting when more state action pairs have been observed
add reference
While tabular methods would be memory intensive for large state spaces, the bigger issue is the time it would take to visit all states and observe and update their values - we need the ability to generalize
With deep RL, we can have some idea of the value of a state even if we’ve never seen it before
Developed by DeepMind in 2014
Stochastic gradient descent needs iid data
A lot of work has been done since 2015 to make these networks even better and more efficient
G is an unbiased estimate of the true Q
Loss function drives policy towards actions with positive reward and away from actions with negative reward
Major issues: noisy gradients (due to randomness of samples), high variance ---> unstable learning and possibly suboptimal policy
for each learning step, we upgrade policy net towards actions that the critic says are good, and update the value net to match the change in the actor’s policy
---> policy iteration
We can swap out A for Q in our loss function without changing the direction of the gradients, but while reducing variance greatly
AC2 introduced by OpenAI and asynchronous method developed by DeepMind
DeepMind has supposedly reduced Google’s energy consumption by 50%
NLP: SalesForce used RL among other text generation models to write high quality summaries of long text.
JPMorgan using RL robot to execute trades at opportune times
Healthcare - optimization of treatment for patients with chronic disease, deciphering medical images
Improving output of GANs by making output adhere to standard rules