SlideShare a Scribd company logo
1 of 38
Survey of Modern
Reinforcement Learning
Julia Maddalena
What to expect from this talk
Part 1 Introduce the foundations of reinforcement learning
● Definitions and basic ideas
● A couple algorithms that work in simple environments
Part 2 Review some state-of-the-art methods
● Higher level concepts, vanilla methods
● Not a complete list of cutting edge methods
Part 3 Current state of reinforcement learning
Part 1
Foundations of reinforcement learning
What is reinforcement learning?
A type of machine learning where
an agent interacts with an
environment and learns to take
actions that result in greater
cumulative reward.
X alone is analyzed for patterns
● PCA
● Cluster analysis
● Outlier detection
X is used to predict Y
● Classification
● Regression
Supervised Learning Unsupervised Learning Reinforcement Learning
Definitions
Reward
Motivation for the agent. Not always obvious what the reward
signal should be
YOU WIN! +1
GAME OVER -1
Stay alive
+1/second
(sort of)
Agent
The learner and decision maker
Environment
Everything external to the agent used to
make decisions
Actions
The set of possible steps the agent can take
depending on the state of the environment
The Problem with Rewards...
Designing reward functions is notoriously difficult
Clark, Jack. “Faulty Reward Functions in the Wild.” OpenAI, 21 Dec. 2016.
1 Irpan, Alex. “Deep Reinforcement Learning Doesn't Work Yet.” Sorta Insightful, 14 Feb. 2018.
Possible reward structure
● Total points
● Time to finish
● Finishing position
Human player
“I’ve taken to imagining deep RL as a
demon that’s deliberately misinterpreting
your reward and actively searching for the
laziest possible local optima.”
- Alex Irpan
Reinforcement Learning Agent
More Definitions
Return
Long-term, discounted reward
Value
Expected return
value of states → V(s)
how good is it to be in state s
value of state-action pairs → Q(s,a)
how good is it to take action a from state s
discount factor
Policy
How the agent should act from a given state → π(a|s)
Markov Decision Process
Markov Process
A random process whose future behavior only depends on the current state.
Sleepy
Energetic
Hungry
70%
50%
15%
70%
50%
35%
50%
Markov Decision Process
Sleepy Energetic Hungry
nap beg
be good
beg
be good
30%
70% 20%
60%
20%
50%
60%
40%
60%
40%
10%
40%
Markov Process + Actions + Reward = Markov Decision Process
+2
-1
-2
-1
+10
+7
-2
-1
+10
+5
-6
-4
To model or not to model
Model-based methods
Transition Model ● We already know the dynamics of the environment
● We simply need to plan our actions to optimize return
Model-free methods
We don’t know or care about the dynamics, we just want to learn a good policy by
exploring the environment
Sample Model ● We don’t know the dynamics
● We try to learn them by exploring the environment
and use them to plan our actions to optimize return
Planning
Learning
Planning and
Learning
Reinforcement Learning Methods
Model-based Model-free
Transition
Model
Sample
Model
Dynamic
Programming
Bellman Equations
Value of each states under optimal policy for Robodog:
Bellman Equation
Bellman Optimality Equation
policy
transition
probabilities
value of the
next state
reward discount factor
value of the
current state
Policy Iteration
Policy evaluation
Makes the value function
consistent with the current policy
Policy improvement
Make the policy greedy with
respect to the current value
function
be good
beg
be good
beg
100%
50%
50%
50%
50%
sleepy nap
energetic
hungry
be good
beg
be good
beg
100%
0%
100%
100%
0%
sleepy nap
energetic
hungry
state value
sleepy 19.88
energetic 20.97
hungry 20.63
state value
sleepy 29.66
energetic 31.66
hungry 31.90
Converge to
optimal policy
and value under
optimal policy
Reinforcement Learning Methods
Model-based Model-free
Transition
Model On-policy
Off-policy
Sample
Model
Value-based Policy-based
Dynamic
Programming
Sarsa
Q-Learning
Monte Carlo
Temporal Difference
When learning happens
Monte Carlo: wait until end of episode before making updates to value estimates
X X
O
X
O
X
X O
O
X
X O
O
X X
X O
O O
X X
X O
O O
X X X
Update value for all
states in episode
X X
O
X
O
X
X O
O
X
Update
value for
previous
state
Temporal difference, TD(0): update every step using estimates of next states
bootstrapping
Update
value for
previous
state
Update
value for
previous
state
Update
value for
previous
state
. . .
in this example, learning = updating value of states
Exploration vs exploitation
𝜀-greedy policy
Exploration vs Exploitation, Will Evans, slideshare.net/willevans
exploitation exploration
Sarsa
S A Q(S, A)
sleepy nap 0
energetic beg 0
energetic be good 0
hungry beg 0
hungry be good 0
Energetic
be good
Hungry
+5
beg
S
A
R
S’
A’
Hungry
beg
Initialize Q(s,a)
For each episode:
• Start in a random state, S.
• Choose action A from S using 𝛆-greedy policy from Q(s,a).
• While S is not terminal:
1. Take action A, observe reward R and new state S’.
2. Choose action A’ from S’ using 𝛆-greedy policy from Q(s,a).
3. Update Q for state S and action A:
4. S ← S’, A ← A’
0.5
Q-Learning
S A Q(S, A)
sleepy nap 0
energetic beg 0
energetic be good 0
hungry beg 0
hungry be good 0
Energetic
be good
Hungry
+5
beg
S
A
R
S’
3. Update Q for state S and action A:
be good
Q = -1 Q = 2
Hungry
Initialize Q(s,a)
For each episode:
• Start in a random state, S.
• While S is not terminal:
1.Choose action A from S using 𝛆-greedy policy from Q(s,a).
2.Take action A, observe reward R and new state S’.
4. S ← S’
beg
0.5
Part 2
State-of-the-art methods
Reinforcement Learning Methods
Model-based Model-free
Transition
Model On-policy
Off-policy
Sample
Model
Value-based Policy-based
Dynamic
Programming
Dyna-Q
Sarsa
Q-Learning
Monte Carlo
Temporal Difference
Dyna-Q
For each episode:
• Start in a random state, S.
• While S is not terminal:
1.Choose action A from S using 𝛆-greedy policy from Q(s,a).
2.Take action A, observe reward R and new state S’.
3.Update Q for state S and action A:
Model(S, A)
S A Q(S, A) R S’
sleepy nap 0 0 NA
energetic beg 0 0 NA
energetic be good 0 0 NA
hungry beg 0 0 NA
hungry be good 0 0 NA
Energetic
be good
Hungry
+5
ordinary Q-Learning
Hungry
beg
Sleepy
-6
R
R
R
⋮
5 hungry0.5
R
R
R
⋮
Initialize Q(s,a) and Model(s,a)
4. Update Model for state S and action A:
5. “Hallucinate” n transitions and use them to update Q:
0.951.3551.7195
Dyna-Q
Deep Reinforcement Learning
2
3
4
5
6
7
8
Black
Box
state, s
Q(s, a)
for each action a
1
s a Q(s,a)
1 X Q(1, X)
1 Y Q(1, Y)
1 Z Q(1, Z)
2 X Q(2, X)
2 Y Q(2, Y)
2 Z Q(2, Z)
3 X Q(3, X)
3 Y Q(3, Y)
3 Z Q(3, Z)
4 X Q(4, X)
4 Y Q(4, Y)
4 Z Q(4, Z)
5 X Q(5, X)
5 Y Q(5, Y)
5 Z Q(5, Z)
6 X Q(6, X)
6 Y Q(6, Y)
6 Z Q(6, Z)
7 X Q(7, X)
7 Y Q(7, Y)
7 Z Q(7, Z)
8 X Q(8, X)
8 Y Q(8, Y)
8 Z Q(8, Z)
Q(s,X)
Q(s,Y)
Q(s,Z)
Reinforcement Learning Methods
Model-based Model-free
Transition
Model On-policy
Off-policy
Sample
Model
Value-based Policy-based
Dynamic
Programming
Dyna-Q
Monte Carlo
Tree Search
Sarsa
Q-Learning
Deep Q Networks*
Monte Carlo Methods
Temporal Difference Methods
* Utilize deep learning
Deep Q Networks (DQN)
Black Box
X
blank
O
state, s
Q(s, a)
for each action a
X
X
O
X
how good is it
to take this
action from
this state?
1. Initialize network.
1. Take one action under Q policy.
s a r s’
1 s1 a1 r1 s2
2 s2 a2 r2 s3
... ... ... ...
t st at rt st
3. Add new information to training data:
4. Use stochastic gradient descent
to update weights based on:
Repeat steps 2 - 4 until convergence
ŷ y
Deep Q Networks (DQN)
1. Initialize network.
2. Take one action under Q policy.
3. Add new information to training data:
1. Use stochastic gradient descent to
update weights based on:
Problem:
● Data not i.i.d.
● Data collected based on an evolving policy, not the optimal
policy that we are trying to learn.
Solution:
Create a replay buffer of size k to take small samples from
Problem:
Instability introduced when updating Q(s, a) using Q(s’, a’)
Solution:
Have a secondary target network used to evaluate Q(s’, a’) and
only sync with primary network after every n training iterations
primary network target network
s a r s’
1 s1 a1 r1 s2
2 s2 a2 r2 s3
... ... ... ...
t st at rt st
s a r s’
1 s1 a1 r1 s2
2 s2 a2 r2 s3
... ... ... ...
t - k st-k at-k rt-k st-k+1
... ... ... ...
t st at rt st+1
Reinforcement Learning Methods
Model-based Model-free
Transition
Model On-policy
Off-policy
Sample
Model
Value-based Policy-based
Dynamic
Programming
Dyna-Q
Monte Carlo
Tree Search
Sarsa
Q-Learning
Deep Q Networks*
REINFORCE*
Monte Carlo Methods
Temporal Difference Methods
* Utilize deep learning
REINFORCE
Black Box
X
blank
O
state, s
𝛑(a|s)
for each action a
X
X
O
X
what is the
probability of
taking this
action under
policy 𝛑?
1. Initialize network.
r1 r2 r3 r4 r5 r6 r7 r8
2. Play out a full episode under 𝛑.
3. For every step t, calculate return
from that state until the end:
4. Use stochastic gradient descent
to update weights based on:
Repeat steps 2 - 4 until convergence
DQN vs REINFORCE
DQN REINFORCE
Learning Off-policy On-policy
Updates Temporal difference Monte Carlo
Output Q(s,a) ➝ Value-based 𝛑(a|s) ➝ Policy-based
Action spaces Small discrete only Large discrete or continuous
Exploration 𝛆-greedy Built-in due to stochastic policy
Convergence Slower to converge Faster to converge
Experience Less experience needed More experience needed
Reinforcement Learning Methods
Model-based Model-free
Transition
Model On-policy
Off-policy
Sample
Model
Value-based Policy-based
Dynamic
Programming
Dyna-Q
Monte Carlo
Tree Search
Sarsa
Q-Learning
Deep Q Networks*
REINFORCE*
Monte Carlo Methods
Temporal Difference Methods
* Utilize deep learning
Advantage Actor-Critic*
Q Actor-Critic
Common
layers
X
blank
O
state, s
Policy
net
Value
net
𝛑(a|s)
for each action a
Q(s,a)
for each action a
Actor
Policy-based like REINFORCE
but can now use temporal
difference learning
Critic
Value-based, works sort of like
DQN
Quick review
Q-Learning DQN REINFORCE Q Actor-Critic A2C
Ability to
generalize
values in state
space
Ability to control in
continuous action
spaces using
stochastic policy
One step updates Reduce variability
in gradients
Advantage vs action value
A(S, A)
Q(S, A)
V(S)
Q(S, A)
V(S)
advantage
Advantage Actor-Critic (A2C)
Common
layers
X
blank
O
state, s
Policy
net
Value
net
𝛑(a|s)
for each action a
V(s)
Actor
Policy-based like REINFORCE
Can now use temporal difference
learning and baseline:
Critic
Value-based, now learns value of
states instead of state-action pairs
Part 3
Current state of reinforcement learning
Current state of reinforcement learning
Mostly in academia or research-focused companies, e.g. DeepMind, OpenAI
● Most impressive progress has been made in games
1 Irpan, Alex. “Deep Reinforcement Learning Doesn't Work Yet.” Sorta Insightful, 14 Feb. 2018.
Barriers to entry:
● Too much real-world experience required
Driverless car, robotics, etc. still largely not using RL.
“The rule-of-thumb is that except in rare cases, domain-specific algorithms work faster and
better than reinforcement learning.”1
“Reinforcement learning is a type of machine learning whose hunger
for data is even greater than supervised learning. It is really difficult to
get enough data for reinforcement learning algorithms. There’s more
work to be done to translate this to businesses and practice.”
- Andrew Ng
● Simulation is often not realistic enough
● Poor convergence properties
● There has not been enough development in transfer learning for RL models
○ Models do not generalize well outside of what they are trained on
Promising applications of RL (that aren’t games)
Energy Finance
Healthcare
Some aspects of
robotics
NLP
Computer
systems
Traffic light
control
Assisting GANs
Neural network
architecture
Computer vision
Education
Recommendation
systems
Science & Math
References
Clark, Jack. “Faulty Reward Functions in the Wild.” OpenAI, 21 Dec. 2016.
Friedman, Lex (2015). MIT: Introduction to Deep Reinforcement Learning. https://www.youtube.com/watch?v=zR11FLZ-O9M
Fullstack Academy (2017). Monte Carlo Tree Search Tutorial. https://www.youtube.com/watch?v=Fbs4lnGLS8M
Irpan, Alex. “Deep Reinforcement Learning Doesn't Work Yet.” Sorta Insightful, 14 Feb. 2018.
Lapan, M. (2018). Deep reinforcement learning hands-on: Apply modern RL methods, with deep Q-networks, value
iteration, policy gradients, TRPO, AlphaGo Zero and more. Birmingham, UK: Packt Publishing.
Silver, David (2015). University College London Reinforcement Learning Course. Lecture 7: Policy Gradient Methods
Towards Data Science. “Applications of Reinforcement Learning in Real World”, 1 Aug 2018.
Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. Cambridge, MA: The MIT Press.

More Related Content

What's hot

Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement LearningUsman Qayyum
 
An introduction to reinforcement learning
An introduction to  reinforcement learningAn introduction to  reinforcement learning
An introduction to reinforcement learningJie-Han Chen
 
TADPole_Nurjahan Begum
TADPole_Nurjahan BegumTADPole_Nurjahan Begum
TADPole_Nurjahan BegumNurjahan Begum
 
An introduction to reinforcement learning
An introduction to reinforcement learningAn introduction to reinforcement learning
An introduction to reinforcement learningSubrat Panda, PhD
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement LearningNatan Katz
 
Financial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning ApproachFinancial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning Approach謙益 黃
 
Logistic Regression | Logistic Regression In Python | Machine Learning Algori...
Logistic Regression | Logistic Regression In Python | Machine Learning Algori...Logistic Regression | Logistic Regression In Python | Machine Learning Algori...
Logistic Regression | Logistic Regression In Python | Machine Learning Algori...Simplilearn
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialOmar Enayet
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningDongHyun Kwak
 
25 introduction reinforcement_learning
25 introduction reinforcement_learning25 introduction reinforcement_learning
25 introduction reinforcement_learningAndres Mendez-Vazquez
 
An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)pauldix
 

What's hot (13)

Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
An introduction to reinforcement learning
An introduction to  reinforcement learningAn introduction to  reinforcement learning
An introduction to reinforcement learning
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
TADPole_Nurjahan Begum
TADPole_Nurjahan BegumTADPole_Nurjahan Begum
TADPole_Nurjahan Begum
 
An introduction to reinforcement learning
An introduction to reinforcement learningAn introduction to reinforcement learning
An introduction to reinforcement learning
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement Learning
 
Financial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning ApproachFinancial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning Approach
 
Logistic Regression | Logistic Regression In Python | Machine Learning Algori...
Logistic Regression | Logistic Regression In Python | Machine Learning Algori...Logistic Regression | Logistic Regression In Python | Machine Learning Algori...
Logistic Regression | Logistic Regression In Python | Machine Learning Algori...
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners Tutorial
 
Deep Q-Learning
Deep Q-LearningDeep Q-Learning
Deep Q-Learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
25 introduction reinforcement_learning
25 introduction reinforcement_learning25 introduction reinforcement_learning
25 introduction reinforcement_learning
 
An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)
 

Similar to Survey of Modern Reinforcement Learning

Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning재연 윤
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningElias Hasnat
 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-LearningKuppusamy P
 
24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptxManiMaran230751
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017MLconf
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningDing Li
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningSalem-Kabbani
 
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
1118_Seminar_Continuous_Deep Q-Learning with Model based accelerationHye-min Ahn
 
Adaptive High-Level Strategy Learning in StarCraft
Adaptive High-Level Strategy Learning in StarCraftAdaptive High-Level Strategy Learning in StarCraft
Adaptive High-Level Strategy Learning in StarCraftJiéverson Maissiat
 
TensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksTensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksBen Ball
 
week10_Reinforce.pdf
week10_Reinforce.pdfweek10_Reinforce.pdf
week10_Reinforce.pdfYuChianWu
 
Intro to Reinforcement Learning
Intro to Reinforcement LearningIntro to Reinforcement Learning
Intro to Reinforcement LearningUtkarsh Garg
 
Reinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine SweeperReinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine SweeperDataScienceLab
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningNAVER Engineering
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning艾鍗科技
 
14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptx14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptxRithikRaj25
 

Similar to Survey of Modern Reinforcement Learning (20)

Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-Learning
 
Introduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement LearningIntroduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement Learning
 
24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Deep Q-learning explained
Deep Q-learning explainedDeep Q-learning explained
Deep Q-learning explained
 
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
 
Cs221 rl
Cs221 rlCs221 rl
Cs221 rl
 
Adaptive High-Level Strategy Learning in StarCraft
Adaptive High-Level Strategy Learning in StarCraftAdaptive High-Level Strategy Learning in StarCraft
Adaptive High-Level Strategy Learning in StarCraft
 
TensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksTensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and Tricks
 
week10_Reinforce.pdf
week10_Reinforce.pdfweek10_Reinforce.pdf
week10_Reinforce.pdf
 
Intro to Reinforcement Learning
Intro to Reinforcement LearningIntro to Reinforcement Learning
Intro to Reinforcement Learning
 
Deep RL.pdf
Deep RL.pdfDeep RL.pdf
Deep RL.pdf
 
Reinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine SweeperReinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine Sweeper
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptx14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptx
 

Recently uploaded

Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 

Recently uploaded (20)

Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 

Survey of Modern Reinforcement Learning

  • 1. Survey of Modern Reinforcement Learning Julia Maddalena
  • 2. What to expect from this talk Part 1 Introduce the foundations of reinforcement learning ● Definitions and basic ideas ● A couple algorithms that work in simple environments Part 2 Review some state-of-the-art methods ● Higher level concepts, vanilla methods ● Not a complete list of cutting edge methods Part 3 Current state of reinforcement learning
  • 3. Part 1 Foundations of reinforcement learning
  • 4. What is reinforcement learning? A type of machine learning where an agent interacts with an environment and learns to take actions that result in greater cumulative reward. X alone is analyzed for patterns ● PCA ● Cluster analysis ● Outlier detection X is used to predict Y ● Classification ● Regression Supervised Learning Unsupervised Learning Reinforcement Learning
  • 5. Definitions Reward Motivation for the agent. Not always obvious what the reward signal should be YOU WIN! +1 GAME OVER -1 Stay alive +1/second (sort of) Agent The learner and decision maker Environment Everything external to the agent used to make decisions Actions The set of possible steps the agent can take depending on the state of the environment
  • 6. The Problem with Rewards... Designing reward functions is notoriously difficult Clark, Jack. “Faulty Reward Functions in the Wild.” OpenAI, 21 Dec. 2016. 1 Irpan, Alex. “Deep Reinforcement Learning Doesn't Work Yet.” Sorta Insightful, 14 Feb. 2018. Possible reward structure ● Total points ● Time to finish ● Finishing position Human player “I’ve taken to imagining deep RL as a demon that’s deliberately misinterpreting your reward and actively searching for the laziest possible local optima.” - Alex Irpan Reinforcement Learning Agent
  • 7. More Definitions Return Long-term, discounted reward Value Expected return value of states → V(s) how good is it to be in state s value of state-action pairs → Q(s,a) how good is it to take action a from state s discount factor Policy How the agent should act from a given state → π(a|s)
  • 8. Markov Decision Process Markov Process A random process whose future behavior only depends on the current state. Sleepy Energetic Hungry 70% 50% 15% 70% 50% 35% 50%
  • 9. Markov Decision Process Sleepy Energetic Hungry nap beg be good beg be good 30% 70% 20% 60% 20% 50% 60% 40% 60% 40% 10% 40% Markov Process + Actions + Reward = Markov Decision Process +2 -1 -2 -1 +10 +7 -2 -1 +10 +5 -6 -4
  • 10. To model or not to model Model-based methods Transition Model ● We already know the dynamics of the environment ● We simply need to plan our actions to optimize return Model-free methods We don’t know or care about the dynamics, we just want to learn a good policy by exploring the environment Sample Model ● We don’t know the dynamics ● We try to learn them by exploring the environment and use them to plan our actions to optimize return Planning Learning Planning and Learning
  • 11. Reinforcement Learning Methods Model-based Model-free Transition Model Sample Model Dynamic Programming
  • 12. Bellman Equations Value of each states under optimal policy for Robodog: Bellman Equation Bellman Optimality Equation policy transition probabilities value of the next state reward discount factor value of the current state
  • 13. Policy Iteration Policy evaluation Makes the value function consistent with the current policy Policy improvement Make the policy greedy with respect to the current value function be good beg be good beg 100% 50% 50% 50% 50% sleepy nap energetic hungry be good beg be good beg 100% 0% 100% 100% 0% sleepy nap energetic hungry state value sleepy 19.88 energetic 20.97 hungry 20.63 state value sleepy 29.66 energetic 31.66 hungry 31.90 Converge to optimal policy and value under optimal policy
  • 14. Reinforcement Learning Methods Model-based Model-free Transition Model On-policy Off-policy Sample Model Value-based Policy-based Dynamic Programming Sarsa Q-Learning Monte Carlo Temporal Difference
  • 15. When learning happens Monte Carlo: wait until end of episode before making updates to value estimates X X O X O X X O O X X O O X X X O O O X X X O O O X X X Update value for all states in episode X X O X O X X O O X Update value for previous state Temporal difference, TD(0): update every step using estimates of next states bootstrapping Update value for previous state Update value for previous state Update value for previous state . . . in this example, learning = updating value of states
  • 16. Exploration vs exploitation 𝜀-greedy policy Exploration vs Exploitation, Will Evans, slideshare.net/willevans exploitation exploration
  • 17. Sarsa S A Q(S, A) sleepy nap 0 energetic beg 0 energetic be good 0 hungry beg 0 hungry be good 0 Energetic be good Hungry +5 beg S A R S’ A’ Hungry beg Initialize Q(s,a) For each episode: • Start in a random state, S. • Choose action A from S using 𝛆-greedy policy from Q(s,a). • While S is not terminal: 1. Take action A, observe reward R and new state S’. 2. Choose action A’ from S’ using 𝛆-greedy policy from Q(s,a). 3. Update Q for state S and action A: 4. S ← S’, A ← A’ 0.5
  • 18. Q-Learning S A Q(S, A) sleepy nap 0 energetic beg 0 energetic be good 0 hungry beg 0 hungry be good 0 Energetic be good Hungry +5 beg S A R S’ 3. Update Q for state S and action A: be good Q = -1 Q = 2 Hungry Initialize Q(s,a) For each episode: • Start in a random state, S. • While S is not terminal: 1.Choose action A from S using 𝛆-greedy policy from Q(s,a). 2.Take action A, observe reward R and new state S’. 4. S ← S’ beg 0.5
  • 20. Reinforcement Learning Methods Model-based Model-free Transition Model On-policy Off-policy Sample Model Value-based Policy-based Dynamic Programming Dyna-Q Sarsa Q-Learning Monte Carlo Temporal Difference
  • 21. Dyna-Q For each episode: • Start in a random state, S. • While S is not terminal: 1.Choose action A from S using 𝛆-greedy policy from Q(s,a). 2.Take action A, observe reward R and new state S’. 3.Update Q for state S and action A: Model(S, A) S A Q(S, A) R S’ sleepy nap 0 0 NA energetic beg 0 0 NA energetic be good 0 0 NA hungry beg 0 0 NA hungry be good 0 0 NA Energetic be good Hungry +5 ordinary Q-Learning Hungry beg Sleepy -6 R R R ⋮ 5 hungry0.5 R R R ⋮ Initialize Q(s,a) and Model(s,a) 4. Update Model for state S and action A: 5. “Hallucinate” n transitions and use them to update Q: 0.951.3551.7195
  • 23. Deep Reinforcement Learning 2 3 4 5 6 7 8 Black Box state, s Q(s, a) for each action a 1 s a Q(s,a) 1 X Q(1, X) 1 Y Q(1, Y) 1 Z Q(1, Z) 2 X Q(2, X) 2 Y Q(2, Y) 2 Z Q(2, Z) 3 X Q(3, X) 3 Y Q(3, Y) 3 Z Q(3, Z) 4 X Q(4, X) 4 Y Q(4, Y) 4 Z Q(4, Z) 5 X Q(5, X) 5 Y Q(5, Y) 5 Z Q(5, Z) 6 X Q(6, X) 6 Y Q(6, Y) 6 Z Q(6, Z) 7 X Q(7, X) 7 Y Q(7, Y) 7 Z Q(7, Z) 8 X Q(8, X) 8 Y Q(8, Y) 8 Z Q(8, Z) Q(s,X) Q(s,Y) Q(s,Z)
  • 24. Reinforcement Learning Methods Model-based Model-free Transition Model On-policy Off-policy Sample Model Value-based Policy-based Dynamic Programming Dyna-Q Monte Carlo Tree Search Sarsa Q-Learning Deep Q Networks* Monte Carlo Methods Temporal Difference Methods * Utilize deep learning
  • 25. Deep Q Networks (DQN) Black Box X blank O state, s Q(s, a) for each action a X X O X how good is it to take this action from this state? 1. Initialize network. 1. Take one action under Q policy. s a r s’ 1 s1 a1 r1 s2 2 s2 a2 r2 s3 ... ... ... ... t st at rt st 3. Add new information to training data: 4. Use stochastic gradient descent to update weights based on: Repeat steps 2 - 4 until convergence ŷ y
  • 26. Deep Q Networks (DQN) 1. Initialize network. 2. Take one action under Q policy. 3. Add new information to training data: 1. Use stochastic gradient descent to update weights based on: Problem: ● Data not i.i.d. ● Data collected based on an evolving policy, not the optimal policy that we are trying to learn. Solution: Create a replay buffer of size k to take small samples from Problem: Instability introduced when updating Q(s, a) using Q(s’, a’) Solution: Have a secondary target network used to evaluate Q(s’, a’) and only sync with primary network after every n training iterations primary network target network s a r s’ 1 s1 a1 r1 s2 2 s2 a2 r2 s3 ... ... ... ... t st at rt st s a r s’ 1 s1 a1 r1 s2 2 s2 a2 r2 s3 ... ... ... ... t - k st-k at-k rt-k st-k+1 ... ... ... ... t st at rt st+1
  • 27. Reinforcement Learning Methods Model-based Model-free Transition Model On-policy Off-policy Sample Model Value-based Policy-based Dynamic Programming Dyna-Q Monte Carlo Tree Search Sarsa Q-Learning Deep Q Networks* REINFORCE* Monte Carlo Methods Temporal Difference Methods * Utilize deep learning
  • 28. REINFORCE Black Box X blank O state, s 𝛑(a|s) for each action a X X O X what is the probability of taking this action under policy 𝛑? 1. Initialize network. r1 r2 r3 r4 r5 r6 r7 r8 2. Play out a full episode under 𝛑. 3. For every step t, calculate return from that state until the end: 4. Use stochastic gradient descent to update weights based on: Repeat steps 2 - 4 until convergence
  • 29. DQN vs REINFORCE DQN REINFORCE Learning Off-policy On-policy Updates Temporal difference Monte Carlo Output Q(s,a) ➝ Value-based 𝛑(a|s) ➝ Policy-based Action spaces Small discrete only Large discrete or continuous Exploration 𝛆-greedy Built-in due to stochastic policy Convergence Slower to converge Faster to converge Experience Less experience needed More experience needed
  • 30. Reinforcement Learning Methods Model-based Model-free Transition Model On-policy Off-policy Sample Model Value-based Policy-based Dynamic Programming Dyna-Q Monte Carlo Tree Search Sarsa Q-Learning Deep Q Networks* REINFORCE* Monte Carlo Methods Temporal Difference Methods * Utilize deep learning Advantage Actor-Critic*
  • 31. Q Actor-Critic Common layers X blank O state, s Policy net Value net 𝛑(a|s) for each action a Q(s,a) for each action a Actor Policy-based like REINFORCE but can now use temporal difference learning Critic Value-based, works sort of like DQN
  • 32. Quick review Q-Learning DQN REINFORCE Q Actor-Critic A2C Ability to generalize values in state space Ability to control in continuous action spaces using stochastic policy One step updates Reduce variability in gradients
  • 33. Advantage vs action value A(S, A) Q(S, A) V(S) Q(S, A) V(S) advantage
  • 34. Advantage Actor-Critic (A2C) Common layers X blank O state, s Policy net Value net 𝛑(a|s) for each action a V(s) Actor Policy-based like REINFORCE Can now use temporal difference learning and baseline: Critic Value-based, now learns value of states instead of state-action pairs
  • 35. Part 3 Current state of reinforcement learning
  • 36. Current state of reinforcement learning Mostly in academia or research-focused companies, e.g. DeepMind, OpenAI ● Most impressive progress has been made in games 1 Irpan, Alex. “Deep Reinforcement Learning Doesn't Work Yet.” Sorta Insightful, 14 Feb. 2018. Barriers to entry: ● Too much real-world experience required Driverless car, robotics, etc. still largely not using RL. “The rule-of-thumb is that except in rare cases, domain-specific algorithms work faster and better than reinforcement learning.”1 “Reinforcement learning is a type of machine learning whose hunger for data is even greater than supervised learning. It is really difficult to get enough data for reinforcement learning algorithms. There’s more work to be done to translate this to businesses and practice.” - Andrew Ng ● Simulation is often not realistic enough ● Poor convergence properties ● There has not been enough development in transfer learning for RL models ○ Models do not generalize well outside of what they are trained on
  • 37. Promising applications of RL (that aren’t games) Energy Finance Healthcare Some aspects of robotics NLP Computer systems Traffic light control Assisting GANs Neural network architecture Computer vision Education Recommendation systems Science & Math
  • 38. References Clark, Jack. “Faulty Reward Functions in the Wild.” OpenAI, 21 Dec. 2016. Friedman, Lex (2015). MIT: Introduction to Deep Reinforcement Learning. https://www.youtube.com/watch?v=zR11FLZ-O9M Fullstack Academy (2017). Monte Carlo Tree Search Tutorial. https://www.youtube.com/watch?v=Fbs4lnGLS8M Irpan, Alex. “Deep Reinforcement Learning Doesn't Work Yet.” Sorta Insightful, 14 Feb. 2018. Lapan, M. (2018). Deep reinforcement learning hands-on: Apply modern RL methods, with deep Q-networks, value iteration, policy gradients, TRPO, AlphaGo Zero and more. Birmingham, UK: Packt Publishing. Silver, David (2015). University College London Reinforcement Learning Course. Lecture 7: Policy Gradient Methods Towards Data Science. “Applications of Reinforcement Learning in Real World”, 1 Aug 2018. Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. Cambridge, MA: The MIT Press.

Editor's Notes

  1. Environment → states of the environment
  2. ****Describe image!!**** Discount factor prevents infinite return Value vs policy based methods
  3. drama/sci-fi
  4. State-action pair Dynamics
  5. Sample model - by learning and planning we are often able to do better than we would with just learning alone
  6. Only applies to single agent fully observable MDPs
  7. Reward can propagate backwards
  8. Almost all reinforcement learning methods are well described as generalized policy iteration
  9. Monte Carlo = low bias, high variance Temporal difference methods = higher bias, lower variance (and don’t need complete episodes in order to learn) Lower variance is often better!
  10. Major consideration in all RL algorithms Greedy action = action that we currently believe has the most value Decrease epsilon over time
  11. Now, we no longer no the dynamics of Robodog
  12. What is the advantage/disadvantage of off-policy vs on-policy?
  13. Q learning and Sarsa were developed in the late 80s. While not state-of-the-art as they only work for small state and action spaces, they laid the foundation for a some of modern reinforcement learning methods covered in Part 2
  14. ***Learning from experience can be expensive*** Not necessarily best system to use last observed reward and new state for our model Q updates from sample moel get more interesting when more state action pairs have been observed
  15. add reference
  16. While tabular methods would be memory intensive for large state spaces, the bigger issue is the time it would take to visit all states and observe and update their values - we need the ability to generalize With deep RL, we can have some idea of the value of a state even if we’ve never seen it before
  17. Developed by DeepMind in 2014
  18. Stochastic gradient descent needs iid data A lot of work has been done since 2015 to make these networks even better and more efficient
  19. G is an unbiased estimate of the true Q Loss function drives policy towards actions with positive reward and away from actions with negative reward Major issues: noisy gradients (due to randomness of samples), high variance ---> unstable learning and possibly suboptimal policy
  20. for each learning step, we upgrade policy net towards actions that the critic says are good, and update the value net to match the change in the actor’s policy ---> policy iteration
  21. We can swap out A for Q in our loss function without changing the direction of the gradients, but while reducing variance greatly AC2 introduced by OpenAI and asynchronous method developed by DeepMind
  22. DeepMind has supposedly reduced Google’s energy consumption by 50% NLP: SalesForce used RL among other text generation models to write high quality summaries of long text. JPMorgan using RL robot to execute trades at opportune times Healthcare - optimization of treatment for patients with chronic disease, deciphering medical images Improving output of GANs by making output adhere to standard rules