SlideShare a Scribd company logo
Survey of Modern
Reinforcement Learning
Julia Maddalena
What to expect from this talk
Part 1 Introduce the foundations of reinforcement learning
● Definitions and basic ideas
● A couple algorithms that work in simple environments
Part 2 Review some state-of-the-art methods
● Higher level concepts, vanilla methods
● Not a complete list of cutting edge methods
Part 3 Current state of reinforcement learning
Part 1
Foundations of reinforcement learning
What is reinforcement learning?
A type of machine learning where
an agent interacts with an
environment and learns to take
actions that result in greater
cumulative reward.
X alone is analyzed for patterns
● PCA
● Cluster analysis
● Outlier detection
X is used to predict Y
● Classification
● Regression
Supervised Learning Unsupervised Learning Reinforcement Learning
Definitions
Reward
Motivation for the agent. Not always obvious what the reward
signal should be
YOU WIN! +1
GAME OVER -1
Stay alive
+1/second
(sort of)
Agent
The learner and decision maker
Environment
Everything external to the agent used to
make decisions
Actions
The set of possible steps the agent can take
depending on the state of the environment
The Problem with Rewards...
Designing reward functions is notoriously difficult
Clark, Jack. “Faulty Reward Functions in the Wild.” OpenAI, 21 Dec. 2016.
1 Irpan, Alex. “Deep Reinforcement Learning Doesn't Work Yet.” Sorta Insightful, 14 Feb. 2018.
Possible reward structure
● Total points
● Time to finish
● Finishing position
Human player
“I’ve taken to imagining deep RL as a
demon that’s deliberately misinterpreting
your reward and actively searching for the
laziest possible local optima.”
- Alex Irpan
Reinforcement Learning Agent
More Definitions
Return
Long-term, discounted reward
Value
Expected return
value of states → V(s)
how good is it to be in state s
value of state-action pairs → Q(s,a)
how good is it to take action a from state s
discount factor
Policy
How the agent should act from a given state → π(a|s)
Markov Decision Process
Markov Process
A random process whose future behavior only depends on the current state.
Sleepy
Energetic
Hungry
70%
50%
15%
70%
50%
35%
50%
Markov Decision Process
Sleepy Energetic Hungry
nap beg
be good
beg
be good
30%
70% 20%
60%
20%
50%
60%
40%
60%
40%
10%
40%
Markov Process + Actions + Reward = Markov Decision Process
+2
-1
-2
-1
+10
+7
-2
-1
+10
+5
-6
-4
To model or not to model
Model-based methods
Transition Model ● We already know the dynamics of the environment
● We simply need to plan our actions to optimize return
Model-free methods
We don’t know or care about the dynamics, we just want to learn a good policy by
exploring the environment
Sample Model ● We don’t know the dynamics
● We try to learn them by exploring the environment
and use them to plan our actions to optimize return
Planning
Learning
Planning and
Learning
Reinforcement Learning Methods
Model-based Model-free
Transition
Model
Sample
Model
Dynamic
Programming
Bellman Equations
Value of each states under optimal policy for Robodog:
Bellman Equation
Bellman Optimality Equation
policy
transition
probabilities
value of the
next state
reward discount factor
value of the
current state
Policy Iteration
Policy evaluation
Makes the value function
consistent with the current policy
Policy improvement
Make the policy greedy with
respect to the current value
function
be good
beg
be good
beg
100%
50%
50%
50%
50%
sleepy nap
energetic
hungry
be good
beg
be good
beg
100%
0%
100%
100%
0%
sleepy nap
energetic
hungry
state value
sleepy 19.88
energetic 20.97
hungry 20.63
state value
sleepy 29.66
energetic 31.66
hungry 31.90
Converge to
optimal policy
and value under
optimal policy
Reinforcement Learning Methods
Model-based Model-free
Transition
Model On-policy
Off-policy
Sample
Model
Value-based Policy-based
Dynamic
Programming
Sarsa
Q-Learning
Monte Carlo
Temporal Difference
When learning happens
Monte Carlo: wait until end of episode before making updates to value estimates
X X
O
X
O
X
X O
O
X
X O
O
X X
X O
O O
X X
X O
O O
X X X
Update value for all
states in episode
X X
O
X
O
X
X O
O
X
Update
value for
previous
state
Temporal difference, TD(0): update every step using estimates of next states
bootstrapping
Update
value for
previous
state
Update
value for
previous
state
Update
value for
previous
state
. . .
in this example, learning = updating value of states
Exploration vs exploitation
𝜀-greedy policy
Exploration vs Exploitation, Will Evans, slideshare.net/willevans
exploitation exploration
Sarsa
S A Q(S, A)
sleepy nap 0
energetic beg 0
energetic be good 0
hungry beg 0
hungry be good 0
Energetic
be good
Hungry
+5
beg
S
A
R
S’
A’
Hungry
beg
Initialize Q(s,a)
For each episode:
• Start in a random state, S.
• Choose action A from S using 𝛆-greedy policy from Q(s,a).
• While S is not terminal:
1. Take action A, observe reward R and new state S’.
2. Choose action A’ from S’ using 𝛆-greedy policy from Q(s,a).
3. Update Q for state S and action A:
4. S ← S’, A ← A’
0.5
Q-Learning
S A Q(S, A)
sleepy nap 0
energetic beg 0
energetic be good 0
hungry beg 0
hungry be good 0
Energetic
be good
Hungry
+5
beg
S
A
R
S’
3. Update Q for state S and action A:
be good
Q = -1 Q = 2
Hungry
Initialize Q(s,a)
For each episode:
• Start in a random state, S.
• While S is not terminal:
1.Choose action A from S using 𝛆-greedy policy from Q(s,a).
2.Take action A, observe reward R and new state S’.
4. S ← S’
beg
0.5
Part 2
State-of-the-art methods
Reinforcement Learning Methods
Model-based Model-free
Transition
Model On-policy
Off-policy
Sample
Model
Value-based Policy-based
Dynamic
Programming
Dyna-Q
Sarsa
Q-Learning
Monte Carlo
Temporal Difference
Dyna-Q
For each episode:
• Start in a random state, S.
• While S is not terminal:
1.Choose action A from S using 𝛆-greedy policy from Q(s,a).
2.Take action A, observe reward R and new state S’.
3.Update Q for state S and action A:
Model(S, A)
S A Q(S, A) R S’
sleepy nap 0 0 NA
energetic beg 0 0 NA
energetic be good 0 0 NA
hungry beg 0 0 NA
hungry be good 0 0 NA
Energetic
be good
Hungry
+5
ordinary Q-Learning
Hungry
beg
Sleepy
-6
R
R
R
⋮
5 hungry0.5
R
R
R
⋮
Initialize Q(s,a) and Model(s,a)
4. Update Model for state S and action A:
5. “Hallucinate” n transitions and use them to update Q:
0.951.3551.7195
Dyna-Q
Deep Reinforcement Learning
2
3
4
5
6
7
8
Black
Box
state, s
Q(s, a)
for each action a
1
s a Q(s,a)
1 X Q(1, X)
1 Y Q(1, Y)
1 Z Q(1, Z)
2 X Q(2, X)
2 Y Q(2, Y)
2 Z Q(2, Z)
3 X Q(3, X)
3 Y Q(3, Y)
3 Z Q(3, Z)
4 X Q(4, X)
4 Y Q(4, Y)
4 Z Q(4, Z)
5 X Q(5, X)
5 Y Q(5, Y)
5 Z Q(5, Z)
6 X Q(6, X)
6 Y Q(6, Y)
6 Z Q(6, Z)
7 X Q(7, X)
7 Y Q(7, Y)
7 Z Q(7, Z)
8 X Q(8, X)
8 Y Q(8, Y)
8 Z Q(8, Z)
Q(s,X)
Q(s,Y)
Q(s,Z)
Reinforcement Learning Methods
Model-based Model-free
Transition
Model On-policy
Off-policy
Sample
Model
Value-based Policy-based
Dynamic
Programming
Dyna-Q
Monte Carlo
Tree Search
Sarsa
Q-Learning
Deep Q Networks*
Monte Carlo Methods
Temporal Difference Methods
* Utilize deep learning
Deep Q Networks (DQN)
Black Box
X
blank
O
state, s
Q(s, a)
for each action a
X
X
O
X
how good is it
to take this
action from
this state?
1. Initialize network.
1. Take one action under Q policy.
s a r s’
1 s1 a1 r1 s2
2 s2 a2 r2 s3
... ... ... ...
t st at rt st
3. Add new information to training data:
4. Use stochastic gradient descent
to update weights based on:
Repeat steps 2 - 4 until convergence
ŷ y
Deep Q Networks (DQN)
1. Initialize network.
2. Take one action under Q policy.
3. Add new information to training data:
1. Use stochastic gradient descent to
update weights based on:
Problem:
● Data not i.i.d.
● Data collected based on an evolving policy, not the optimal
policy that we are trying to learn.
Solution:
Create a replay buffer of size k to take small samples from
Problem:
Instability introduced when updating Q(s, a) using Q(s’, a’)
Solution:
Have a secondary target network used to evaluate Q(s’, a’) and
only sync with primary network after every n training iterations
primary network target network
s a r s’
1 s1 a1 r1 s2
2 s2 a2 r2 s3
... ... ... ...
t st at rt st
s a r s’
1 s1 a1 r1 s2
2 s2 a2 r2 s3
... ... ... ...
t - k st-k at-k rt-k st-k+1
... ... ... ...
t st at rt st+1
Reinforcement Learning Methods
Model-based Model-free
Transition
Model On-policy
Off-policy
Sample
Model
Value-based Policy-based
Dynamic
Programming
Dyna-Q
Monte Carlo
Tree Search
Sarsa
Q-Learning
Deep Q Networks*
REINFORCE*
Monte Carlo Methods
Temporal Difference Methods
* Utilize deep learning
REINFORCE
Black Box
X
blank
O
state, s
𝛑(a|s)
for each action a
X
X
O
X
what is the
probability of
taking this
action under
policy 𝛑?
1. Initialize network.
r1 r2 r3 r4 r5 r6 r7 r8
2. Play out a full episode under 𝛑.
3. For every step t, calculate return
from that state until the end:
4. Use stochastic gradient descent
to update weights based on:
Repeat steps 2 - 4 until convergence
DQN vs REINFORCE
DQN REINFORCE
Learning Off-policy On-policy
Updates Temporal difference Monte Carlo
Output Q(s,a) ➝ Value-based 𝛑(a|s) ➝ Policy-based
Action spaces Small discrete only Large discrete or continuous
Exploration 𝛆-greedy Built-in due to stochastic policy
Convergence Slower to converge Faster to converge
Experience Less experience needed More experience needed
Reinforcement Learning Methods
Model-based Model-free
Transition
Model On-policy
Off-policy
Sample
Model
Value-based Policy-based
Dynamic
Programming
Dyna-Q
Monte Carlo
Tree Search
Sarsa
Q-Learning
Deep Q Networks*
REINFORCE*
Monte Carlo Methods
Temporal Difference Methods
* Utilize deep learning
Advantage Actor-Critic*
Q Actor-Critic
Common
layers
X
blank
O
state, s
Policy
net
Value
net
𝛑(a|s)
for each action a
Q(s,a)
for each action a
Actor
Policy-based like REINFORCE
but can now use temporal
difference learning
Critic
Value-based, works sort of like
DQN
Quick review
Q-Learning DQN REINFORCE Q Actor-Critic A2C
Ability to
generalize
values in state
space
Ability to control in
continuous action
spaces using
stochastic policy
One step updates Reduce variability
in gradients
Advantage vs action value
A(S, A)
Q(S, A)
V(S)
Q(S, A)
V(S)
advantage
Advantage Actor-Critic (A2C)
Common
layers
X
blank
O
state, s
Policy
net
Value
net
𝛑(a|s)
for each action a
V(s)
Actor
Policy-based like REINFORCE
Can now use temporal difference
learning and baseline:
Critic
Value-based, now learns value of
states instead of state-action pairs
Part 3
Current state of reinforcement learning
Current state of reinforcement learning
Mostly in academia or research-focused companies, e.g. DeepMind, OpenAI
● Most impressive progress has been made in games
1 Irpan, Alex. “Deep Reinforcement Learning Doesn't Work Yet.” Sorta Insightful, 14 Feb. 2018.
Barriers to entry:
● Too much real-world experience required
Driverless car, robotics, etc. still largely not using RL.
“The rule-of-thumb is that except in rare cases, domain-specific algorithms work faster and
better than reinforcement learning.”1
“Reinforcement learning is a type of machine learning whose hunger
for data is even greater than supervised learning. It is really difficult to
get enough data for reinforcement learning algorithms. There’s more
work to be done to translate this to businesses and practice.”
- Andrew Ng
● Simulation is often not realistic enough
● Poor convergence properties
● There has not been enough development in transfer learning for RL models
○ Models do not generalize well outside of what they are trained on
Promising applications of RL (that aren’t games)
Energy Finance
Healthcare
Some aspects of
robotics
NLP
Computer
systems
Traffic light
control
Assisting GANs
Neural network
architecture
Computer vision
Education
Recommendation
systems
Science & Math
References
Clark, Jack. “Faulty Reward Functions in the Wild.” OpenAI, 21 Dec. 2016.
Friedman, Lex (2015). MIT: Introduction to Deep Reinforcement Learning. https://www.youtube.com/watch?v=zR11FLZ-O9M
Fullstack Academy (2017). Monte Carlo Tree Search Tutorial. https://www.youtube.com/watch?v=Fbs4lnGLS8M
Irpan, Alex. “Deep Reinforcement Learning Doesn't Work Yet.” Sorta Insightful, 14 Feb. 2018.
Lapan, M. (2018). Deep reinforcement learning hands-on: Apply modern RL methods, with deep Q-networks, value
iteration, policy gradients, TRPO, AlphaGo Zero and more. Birmingham, UK: Packt Publishing.
Silver, David (2015). University College London Reinforcement Learning Course. Lecture 7: Policy Gradient Methods
Towards Data Science. “Applications of Reinforcement Learning in Real World”, 1 Aug 2018.
Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. Cambridge, MA: The MIT Press.

More Related Content

What's hot

Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
Usman Qayyum
 
An introduction to reinforcement learning
An introduction to  reinforcement learningAn introduction to  reinforcement learning
An introduction to reinforcement learning
Jie-Han Chen
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
MeetupDataScienceRoma
 
TADPole_Nurjahan Begum
TADPole_Nurjahan BegumTADPole_Nurjahan Begum
TADPole_Nurjahan Begum
Nurjahan Begum
 
An introduction to reinforcement learning
An introduction to reinforcement learningAn introduction to reinforcement learning
An introduction to reinforcement learning
Subrat Panda, PhD
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement Learning
Natan Katz
 
Financial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning ApproachFinancial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning Approach
謙益 黃
 
Logistic Regression | Logistic Regression In Python | Machine Learning Algori...
Logistic Regression | Logistic Regression In Python | Machine Learning Algori...Logistic Regression | Logistic Regression In Python | Machine Learning Algori...
Logistic Regression | Logistic Regression In Python | Machine Learning Algori...
Simplilearn
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners Tutorial
Omar Enayet
 
Deep Q-Learning
Deep Q-LearningDeep Q-Learning
Deep Q-Learning
Nikolay Pavlov
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
DongHyun Kwak
 
25 introduction reinforcement_learning
25 introduction reinforcement_learning25 introduction reinforcement_learning
25 introduction reinforcement_learning
Andres Mendez-Vazquez
 
An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)
pauldix
 

What's hot (13)

Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
An introduction to reinforcement learning
An introduction to  reinforcement learningAn introduction to  reinforcement learning
An introduction to reinforcement learning
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
TADPole_Nurjahan Begum
TADPole_Nurjahan BegumTADPole_Nurjahan Begum
TADPole_Nurjahan Begum
 
An introduction to reinforcement learning
An introduction to reinforcement learningAn introduction to reinforcement learning
An introduction to reinforcement learning
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement Learning
 
Financial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning ApproachFinancial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning Approach
 
Logistic Regression | Logistic Regression In Python | Machine Learning Algori...
Logistic Regression | Logistic Regression In Python | Machine Learning Algori...Logistic Regression | Logistic Regression In Python | Machine Learning Algori...
Logistic Regression | Logistic Regression In Python | Machine Learning Algori...
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners Tutorial
 
Deep Q-Learning
Deep Q-LearningDeep Q-Learning
Deep Q-Learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
25 introduction reinforcement_learning
25 introduction reinforcement_learning25 introduction reinforcement_learning
25 introduction reinforcement_learning
 
An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)
 

Similar to Survey of Modern Reinforcement Learning

Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning
재연 윤
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
Elias Hasnat
 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-Learning
Kuppusamy P
 
Introduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement LearningIntroduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement Learning
IDEAS - Int'l Data Engineering and Science Association
 
24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx
ManiMaran230751
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
MLconf
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
Ding Li
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
Salem-Kabbani
 
Deep Q-learning explained
Deep Q-learning explainedDeep Q-learning explained
Deep Q-learning explained
azzeddine chenine
 
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
Hye-min Ahn
 
Cs221 rl
Cs221 rlCs221 rl
Cs221 rl
darwinrlo
 
Adaptive High-Level Strategy Learning in StarCraft
Adaptive High-Level Strategy Learning in StarCraftAdaptive High-Level Strategy Learning in StarCraft
Adaptive High-Level Strategy Learning in StarCraft
Jiéverson Maissiat
 
TensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksTensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and Tricks
Ben Ball
 
week10_Reinforce.pdf
week10_Reinforce.pdfweek10_Reinforce.pdf
week10_Reinforce.pdf
YuChianWu
 
Intro to Reinforcement Learning
Intro to Reinforcement LearningIntro to Reinforcement Learning
Intro to Reinforcement Learning
Utkarsh Garg
 
Deep RL.pdf
Deep RL.pdfDeep RL.pdf
Reinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine SweeperReinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine Sweeper
DataScienceLab
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
NAVER Engineering
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
艾鍗科技
 
14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptx14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptx
RithikRaj25
 

Similar to Survey of Modern Reinforcement Learning (20)

Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-Learning
 
Introduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement LearningIntroduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement Learning
 
24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Deep Q-learning explained
Deep Q-learning explainedDeep Q-learning explained
Deep Q-learning explained
 
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
 
Cs221 rl
Cs221 rlCs221 rl
Cs221 rl
 
Adaptive High-Level Strategy Learning in StarCraft
Adaptive High-Level Strategy Learning in StarCraftAdaptive High-Level Strategy Learning in StarCraft
Adaptive High-Level Strategy Learning in StarCraft
 
TensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksTensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and Tricks
 
week10_Reinforce.pdf
week10_Reinforce.pdfweek10_Reinforce.pdf
week10_Reinforce.pdf
 
Intro to Reinforcement Learning
Intro to Reinforcement LearningIntro to Reinforcement Learning
Intro to Reinforcement Learning
 
Deep RL.pdf
Deep RL.pdfDeep RL.pdf
Deep RL.pdf
 
Reinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine SweeperReinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine Sweeper
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptx14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptx
 

Recently uploaded

Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
ScyllaDB
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
AstuteBusiness
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
Miro Wengner
 
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
Fwdays
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
Pablo Gómez Abajo
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
Edge AI and Vision Alliance
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
Javier Junquera
 
What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
DianaGray10
 
Principle of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptxPrinciple of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptx
BibashShahi
 
AppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSFAppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSF
Ajin Abraham
 
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
saastr
 
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframeDigital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Precisely
 
"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota
Fwdays
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Tatiana Kojar
 
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Pitangent Analytics & Technology Solutions Pvt. Ltd
 
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid ResearchHarnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Neo4j
 

Recently uploaded (20)

Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
 
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
 
What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
 
Principle of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptxPrinciple of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptx
 
AppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSFAppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSF
 
Artificial Intelligence and Electronic Warfare
Artificial Intelligence and Electronic WarfareArtificial Intelligence and Electronic Warfare
Artificial Intelligence and Electronic Warfare
 
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
 
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframeDigital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
 
"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
 
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
 
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid ResearchHarnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
 

Survey of Modern Reinforcement Learning

  • 1. Survey of Modern Reinforcement Learning Julia Maddalena
  • 2. What to expect from this talk Part 1 Introduce the foundations of reinforcement learning ● Definitions and basic ideas ● A couple algorithms that work in simple environments Part 2 Review some state-of-the-art methods ● Higher level concepts, vanilla methods ● Not a complete list of cutting edge methods Part 3 Current state of reinforcement learning
  • 3. Part 1 Foundations of reinforcement learning
  • 4. What is reinforcement learning? A type of machine learning where an agent interacts with an environment and learns to take actions that result in greater cumulative reward. X alone is analyzed for patterns ● PCA ● Cluster analysis ● Outlier detection X is used to predict Y ● Classification ● Regression Supervised Learning Unsupervised Learning Reinforcement Learning
  • 5. Definitions Reward Motivation for the agent. Not always obvious what the reward signal should be YOU WIN! +1 GAME OVER -1 Stay alive +1/second (sort of) Agent The learner and decision maker Environment Everything external to the agent used to make decisions Actions The set of possible steps the agent can take depending on the state of the environment
  • 6. The Problem with Rewards... Designing reward functions is notoriously difficult Clark, Jack. “Faulty Reward Functions in the Wild.” OpenAI, 21 Dec. 2016. 1 Irpan, Alex. “Deep Reinforcement Learning Doesn't Work Yet.” Sorta Insightful, 14 Feb. 2018. Possible reward structure ● Total points ● Time to finish ● Finishing position Human player “I’ve taken to imagining deep RL as a demon that’s deliberately misinterpreting your reward and actively searching for the laziest possible local optima.” - Alex Irpan Reinforcement Learning Agent
  • 7. More Definitions Return Long-term, discounted reward Value Expected return value of states → V(s) how good is it to be in state s value of state-action pairs → Q(s,a) how good is it to take action a from state s discount factor Policy How the agent should act from a given state → π(a|s)
  • 8. Markov Decision Process Markov Process A random process whose future behavior only depends on the current state. Sleepy Energetic Hungry 70% 50% 15% 70% 50% 35% 50%
  • 9. Markov Decision Process Sleepy Energetic Hungry nap beg be good beg be good 30% 70% 20% 60% 20% 50% 60% 40% 60% 40% 10% 40% Markov Process + Actions + Reward = Markov Decision Process +2 -1 -2 -1 +10 +7 -2 -1 +10 +5 -6 -4
  • 10. To model or not to model Model-based methods Transition Model ● We already know the dynamics of the environment ● We simply need to plan our actions to optimize return Model-free methods We don’t know or care about the dynamics, we just want to learn a good policy by exploring the environment Sample Model ● We don’t know the dynamics ● We try to learn them by exploring the environment and use them to plan our actions to optimize return Planning Learning Planning and Learning
  • 11. Reinforcement Learning Methods Model-based Model-free Transition Model Sample Model Dynamic Programming
  • 12. Bellman Equations Value of each states under optimal policy for Robodog: Bellman Equation Bellman Optimality Equation policy transition probabilities value of the next state reward discount factor value of the current state
  • 13. Policy Iteration Policy evaluation Makes the value function consistent with the current policy Policy improvement Make the policy greedy with respect to the current value function be good beg be good beg 100% 50% 50% 50% 50% sleepy nap energetic hungry be good beg be good beg 100% 0% 100% 100% 0% sleepy nap energetic hungry state value sleepy 19.88 energetic 20.97 hungry 20.63 state value sleepy 29.66 energetic 31.66 hungry 31.90 Converge to optimal policy and value under optimal policy
  • 14. Reinforcement Learning Methods Model-based Model-free Transition Model On-policy Off-policy Sample Model Value-based Policy-based Dynamic Programming Sarsa Q-Learning Monte Carlo Temporal Difference
  • 15. When learning happens Monte Carlo: wait until end of episode before making updates to value estimates X X O X O X X O O X X O O X X X O O O X X X O O O X X X Update value for all states in episode X X O X O X X O O X Update value for previous state Temporal difference, TD(0): update every step using estimates of next states bootstrapping Update value for previous state Update value for previous state Update value for previous state . . . in this example, learning = updating value of states
  • 16. Exploration vs exploitation 𝜀-greedy policy Exploration vs Exploitation, Will Evans, slideshare.net/willevans exploitation exploration
  • 17. Sarsa S A Q(S, A) sleepy nap 0 energetic beg 0 energetic be good 0 hungry beg 0 hungry be good 0 Energetic be good Hungry +5 beg S A R S’ A’ Hungry beg Initialize Q(s,a) For each episode: • Start in a random state, S. • Choose action A from S using 𝛆-greedy policy from Q(s,a). • While S is not terminal: 1. Take action A, observe reward R and new state S’. 2. Choose action A’ from S’ using 𝛆-greedy policy from Q(s,a). 3. Update Q for state S and action A: 4. S ← S’, A ← A’ 0.5
  • 18. Q-Learning S A Q(S, A) sleepy nap 0 energetic beg 0 energetic be good 0 hungry beg 0 hungry be good 0 Energetic be good Hungry +5 beg S A R S’ 3. Update Q for state S and action A: be good Q = -1 Q = 2 Hungry Initialize Q(s,a) For each episode: • Start in a random state, S. • While S is not terminal: 1.Choose action A from S using 𝛆-greedy policy from Q(s,a). 2.Take action A, observe reward R and new state S’. 4. S ← S’ beg 0.5
  • 20. Reinforcement Learning Methods Model-based Model-free Transition Model On-policy Off-policy Sample Model Value-based Policy-based Dynamic Programming Dyna-Q Sarsa Q-Learning Monte Carlo Temporal Difference
  • 21. Dyna-Q For each episode: • Start in a random state, S. • While S is not terminal: 1.Choose action A from S using 𝛆-greedy policy from Q(s,a). 2.Take action A, observe reward R and new state S’. 3.Update Q for state S and action A: Model(S, A) S A Q(S, A) R S’ sleepy nap 0 0 NA energetic beg 0 0 NA energetic be good 0 0 NA hungry beg 0 0 NA hungry be good 0 0 NA Energetic be good Hungry +5 ordinary Q-Learning Hungry beg Sleepy -6 R R R ⋮ 5 hungry0.5 R R R ⋮ Initialize Q(s,a) and Model(s,a) 4. Update Model for state S and action A: 5. “Hallucinate” n transitions and use them to update Q: 0.951.3551.7195
  • 23. Deep Reinforcement Learning 2 3 4 5 6 7 8 Black Box state, s Q(s, a) for each action a 1 s a Q(s,a) 1 X Q(1, X) 1 Y Q(1, Y) 1 Z Q(1, Z) 2 X Q(2, X) 2 Y Q(2, Y) 2 Z Q(2, Z) 3 X Q(3, X) 3 Y Q(3, Y) 3 Z Q(3, Z) 4 X Q(4, X) 4 Y Q(4, Y) 4 Z Q(4, Z) 5 X Q(5, X) 5 Y Q(5, Y) 5 Z Q(5, Z) 6 X Q(6, X) 6 Y Q(6, Y) 6 Z Q(6, Z) 7 X Q(7, X) 7 Y Q(7, Y) 7 Z Q(7, Z) 8 X Q(8, X) 8 Y Q(8, Y) 8 Z Q(8, Z) Q(s,X) Q(s,Y) Q(s,Z)
  • 24. Reinforcement Learning Methods Model-based Model-free Transition Model On-policy Off-policy Sample Model Value-based Policy-based Dynamic Programming Dyna-Q Monte Carlo Tree Search Sarsa Q-Learning Deep Q Networks* Monte Carlo Methods Temporal Difference Methods * Utilize deep learning
  • 25. Deep Q Networks (DQN) Black Box X blank O state, s Q(s, a) for each action a X X O X how good is it to take this action from this state? 1. Initialize network. 1. Take one action under Q policy. s a r s’ 1 s1 a1 r1 s2 2 s2 a2 r2 s3 ... ... ... ... t st at rt st 3. Add new information to training data: 4. Use stochastic gradient descent to update weights based on: Repeat steps 2 - 4 until convergence ŷ y
  • 26. Deep Q Networks (DQN) 1. Initialize network. 2. Take one action under Q policy. 3. Add new information to training data: 1. Use stochastic gradient descent to update weights based on: Problem: ● Data not i.i.d. ● Data collected based on an evolving policy, not the optimal policy that we are trying to learn. Solution: Create a replay buffer of size k to take small samples from Problem: Instability introduced when updating Q(s, a) using Q(s’, a’) Solution: Have a secondary target network used to evaluate Q(s’, a’) and only sync with primary network after every n training iterations primary network target network s a r s’ 1 s1 a1 r1 s2 2 s2 a2 r2 s3 ... ... ... ... t st at rt st s a r s’ 1 s1 a1 r1 s2 2 s2 a2 r2 s3 ... ... ... ... t - k st-k at-k rt-k st-k+1 ... ... ... ... t st at rt st+1
  • 27. Reinforcement Learning Methods Model-based Model-free Transition Model On-policy Off-policy Sample Model Value-based Policy-based Dynamic Programming Dyna-Q Monte Carlo Tree Search Sarsa Q-Learning Deep Q Networks* REINFORCE* Monte Carlo Methods Temporal Difference Methods * Utilize deep learning
  • 28. REINFORCE Black Box X blank O state, s 𝛑(a|s) for each action a X X O X what is the probability of taking this action under policy 𝛑? 1. Initialize network. r1 r2 r3 r4 r5 r6 r7 r8 2. Play out a full episode under 𝛑. 3. For every step t, calculate return from that state until the end: 4. Use stochastic gradient descent to update weights based on: Repeat steps 2 - 4 until convergence
  • 29. DQN vs REINFORCE DQN REINFORCE Learning Off-policy On-policy Updates Temporal difference Monte Carlo Output Q(s,a) ➝ Value-based 𝛑(a|s) ➝ Policy-based Action spaces Small discrete only Large discrete or continuous Exploration 𝛆-greedy Built-in due to stochastic policy Convergence Slower to converge Faster to converge Experience Less experience needed More experience needed
  • 30. Reinforcement Learning Methods Model-based Model-free Transition Model On-policy Off-policy Sample Model Value-based Policy-based Dynamic Programming Dyna-Q Monte Carlo Tree Search Sarsa Q-Learning Deep Q Networks* REINFORCE* Monte Carlo Methods Temporal Difference Methods * Utilize deep learning Advantage Actor-Critic*
  • 31. Q Actor-Critic Common layers X blank O state, s Policy net Value net 𝛑(a|s) for each action a Q(s,a) for each action a Actor Policy-based like REINFORCE but can now use temporal difference learning Critic Value-based, works sort of like DQN
  • 32. Quick review Q-Learning DQN REINFORCE Q Actor-Critic A2C Ability to generalize values in state space Ability to control in continuous action spaces using stochastic policy One step updates Reduce variability in gradients
  • 33. Advantage vs action value A(S, A) Q(S, A) V(S) Q(S, A) V(S) advantage
  • 34. Advantage Actor-Critic (A2C) Common layers X blank O state, s Policy net Value net 𝛑(a|s) for each action a V(s) Actor Policy-based like REINFORCE Can now use temporal difference learning and baseline: Critic Value-based, now learns value of states instead of state-action pairs
  • 35. Part 3 Current state of reinforcement learning
  • 36. Current state of reinforcement learning Mostly in academia or research-focused companies, e.g. DeepMind, OpenAI ● Most impressive progress has been made in games 1 Irpan, Alex. “Deep Reinforcement Learning Doesn't Work Yet.” Sorta Insightful, 14 Feb. 2018. Barriers to entry: ● Too much real-world experience required Driverless car, robotics, etc. still largely not using RL. “The rule-of-thumb is that except in rare cases, domain-specific algorithms work faster and better than reinforcement learning.”1 “Reinforcement learning is a type of machine learning whose hunger for data is even greater than supervised learning. It is really difficult to get enough data for reinforcement learning algorithms. There’s more work to be done to translate this to businesses and practice.” - Andrew Ng ● Simulation is often not realistic enough ● Poor convergence properties ● There has not been enough development in transfer learning for RL models ○ Models do not generalize well outside of what they are trained on
  • 37. Promising applications of RL (that aren’t games) Energy Finance Healthcare Some aspects of robotics NLP Computer systems Traffic light control Assisting GANs Neural network architecture Computer vision Education Recommendation systems Science & Math
  • 38. References Clark, Jack. “Faulty Reward Functions in the Wild.” OpenAI, 21 Dec. 2016. Friedman, Lex (2015). MIT: Introduction to Deep Reinforcement Learning. https://www.youtube.com/watch?v=zR11FLZ-O9M Fullstack Academy (2017). Monte Carlo Tree Search Tutorial. https://www.youtube.com/watch?v=Fbs4lnGLS8M Irpan, Alex. “Deep Reinforcement Learning Doesn't Work Yet.” Sorta Insightful, 14 Feb. 2018. Lapan, M. (2018). Deep reinforcement learning hands-on: Apply modern RL methods, with deep Q-networks, value iteration, policy gradients, TRPO, AlphaGo Zero and more. Birmingham, UK: Packt Publishing. Silver, David (2015). University College London Reinforcement Learning Course. Lecture 7: Policy Gradient Methods Towards Data Science. “Applications of Reinforcement Learning in Real World”, 1 Aug 2018. Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. Cambridge, MA: The MIT Press.

Editor's Notes

  1. Environment → states of the environment
  2. ****Describe image!!**** Discount factor prevents infinite return Value vs policy based methods
  3. drama/sci-fi
  4. State-action pair Dynamics
  5. Sample model - by learning and planning we are often able to do better than we would with just learning alone
  6. Only applies to single agent fully observable MDPs
  7. Reward can propagate backwards
  8. Almost all reinforcement learning methods are well described as generalized policy iteration
  9. Monte Carlo = low bias, high variance Temporal difference methods = higher bias, lower variance (and don’t need complete episodes in order to learn) Lower variance is often better!
  10. Major consideration in all RL algorithms Greedy action = action that we currently believe has the most value Decrease epsilon over time
  11. Now, we no longer no the dynamics of Robodog
  12. What is the advantage/disadvantage of off-policy vs on-policy?
  13. Q learning and Sarsa were developed in the late 80s. While not state-of-the-art as they only work for small state and action spaces, they laid the foundation for a some of modern reinforcement learning methods covered in Part 2
  14. ***Learning from experience can be expensive*** Not necessarily best system to use last observed reward and new state for our model Q updates from sample moel get more interesting when more state action pairs have been observed
  15. add reference
  16. While tabular methods would be memory intensive for large state spaces, the bigger issue is the time it would take to visit all states and observe and update their values - we need the ability to generalize With deep RL, we can have some idea of the value of a state even if we’ve never seen it before
  17. Developed by DeepMind in 2014
  18. Stochastic gradient descent needs iid data A lot of work has been done since 2015 to make these networks even better and more efficient
  19. G is an unbiased estimate of the true Q Loss function drives policy towards actions with positive reward and away from actions with negative reward Major issues: noisy gradients (due to randomness of samples), high variance ---> unstable learning and possibly suboptimal policy
  20. for each learning step, we upgrade policy net towards actions that the critic says are good, and update the value net to match the change in the actor’s policy ---> policy iteration
  21. We can swap out A for Q in our loss function without changing the direction of the gradients, but while reducing variance greatly AC2 introduced by OpenAI and asynchronous method developed by DeepMind
  22. DeepMind has supposedly reduced Google’s energy consumption by 50% NLP: SalesForce used RL among other text generation models to write high quality summaries of long text. JPMorgan using RL robot to execute trades at opportune times Healthcare - optimization of treatment for patients with chronic disease, deciphering medical images Improving output of GANs by making output adhere to standard rules