SlideShare a Scribd company logo
Defining the Problem
Learning
Improvements
Reinforcement Learning
¨Orjan Ekeberg Machine Learning
Defining the Problem
Learning
Improvements
1 Defining the Problem
Reward Maximization
Simplifying Assumptions
Bellman’s Equation
2 Learning
Monte-Carlo Method
Temporal-Difference
Learning to Act
Q-Learning
3 Improvements
Importance of Making Mistakes
Eligibility Trace
¨Orjan Ekeberg Machine Learning
Defining the Problem
Learning
Improvements
Reward Maximization
Simplifying Assumptions
Bellman’s Equation
1 Defining the Problem
Reward Maximization
Simplifying Assumptions
Bellman’s Equation
2 Learning
Monte-Carlo Method
Temporal-Difference
Learning to Act
Q-Learning
3 Improvements
Importance of Making Mistakes
Eligibility Trace
¨Orjan Ekeberg Machine Learning
Defining the Problem
Learning
Improvements
Reward Maximization
Simplifying Assumptions
Bellman’s Equation
Reinforcement Learning
Learning of a behavior without
explicit information about correct actions
A reward gives information about success
¨Orjan Ekeberg Machine Learning
Defining the Problem
Learning
Improvements
Reward Maximization
Simplifying Assumptions
Bellman’s Equation
Agent–Environment Abstraction:
Agent
Environment
ActionState
Reward
Policy — Choice of action, depending on current state
Learning Objective:
Develop a policy which maximizes the reward
... total reward over the agents lifetime!
¨Orjan Ekeberg Machine Learning
Defining the Problem
Learning
Improvements
Reward Maximization
Simplifying Assumptions
Bellman’s Equation
Credit Assignment Problems
The reward does not necessarily arrive when you do something
good
Temporal credit assignment
The reward does not say what was good
Structural credit assignment
¨Orjan Ekeberg Machine Learning
Defining the Problem
Learning
Improvements
Reward Maximization
Simplifying Assumptions
Bellman’s Equation
Consider a minimal behavior selection situation:
What is the best action to make?
−1
2
−1
2
−1
2
−1
1
0
0
3
−1
Reward — Immediate consequence of our decision
Planning — Taking future reward into account
Optimal Behavior — Maximize total reward
¨Orjan Ekeberg Machine Learning
Defining the Problem
Learning
Improvements
Reward Maximization
Simplifying Assumptions
Bellman’s Equation
Can the agent make the right decision without explicitly modeling
the future?
Yes! If the Sum of Future Rewards for each situation is known.
−1
2
−1
1
0
0
3
−1
4
−2
2
−2
−1
2
8
−4
Call this the State Value
State Values are subjective
Depend on the agents own behavior
Must be re-adjusted as the agents behavior improves
¨Orjan Ekeberg Machine Learning
Defining the Problem
Learning
Improvements
Reward Maximization
Simplifying Assumptions
Bellman’s Equation
Planning Horizon — How long is the future?
−1
0
1
2
Reward
Time
−1
0
1
2
Reward
Time
Infinite future allows for infinite postponing
Discounting — Make early reward more valuable
Discount factor (γ) — Time scale of planning
¨Orjan Ekeberg Machine Learning
Defining the Problem
Learning
Improvements
Reward Maximization
Simplifying Assumptions
Bellman’s Equation
Finite Horizon
max
h
t=0
rt
Infinite Horizon
max
∞
t=0
γt
rt
Requires discount of future reward (0 < γ < 1)
¨Orjan Ekeberg Machine Learning
Defining the Problem
Learning
Improvements
Reward Maximization
Simplifying Assumptions
Bellman’s Equation
The Reward function controls which task should be solved
Game (Chess, Backgammon)
Reward only at the end: +1 when winning, −1 when loosing
Avoiding mistakes (cycling, balancing, ...)
Reward −1 at the end (when failing)
Find a short/fast/cheap path to a goal
Reward −1 at each step
¨Orjan Ekeberg Machine Learning
Defining the Problem
Learning
Improvements
Reward Maximization
Simplifying Assumptions
Bellman’s Equation
Simplifying Assumptions
Discrete time
Finite number of actions ai
ai ∈ a1, a2, a3, . . . , an
Finite number of states si
si ∈ s1, s2, s3, . . . , sm
Environment is a stationary Markov Decision Process
Reward and next state depends only on s, a and chance
¨Orjan Ekeberg Machine Learning
Defining the Problem
Learning
Improvements
Reward Maximization
Simplifying Assumptions
Bellman’s Equation
Classical model problem: Grid World
Each state is represented by a position in a grid
The agent acts by moving to other positions
G
G
Trivial labyrinth
Reward: −1 at each step until
a goal state (G) is reached
¨Orjan Ekeberg Machine Learning
Defining the Problem
Learning
Improvements
Reward Maximization
Simplifying Assumptions
Bellman’s Equation
The values of a state depends on the current policy.
0
0
−1 −2 −3
−1 −2 −3 −2
−2 −3 −2 −1
−3 −2 −1
V with an
optimal policy
−14
−14
−14
−140 −20 −22
−18 −22 −20
0
−18−22
−22 −20
−20
V with a
random policy
¨Orjan Ekeberg Machine Learning
Defining the Problem
Learning
Improvements
Reward Maximization
Simplifying Assumptions
Bellman’s Equation
The Agents Internal Representation
Policy
The action chosen by the agent for each state
π(s) → a
Value Function
Expected total future reward from s when following policy π
V π
(s) →
¨Orjan Ekeberg Machine Learning
Defining the Problem
Learning
Improvements
Reward Maximization
Simplifying Assumptions
Bellman’s Equation
Representation of the Environment
Where does an action take us?
δ(s, a) → s
How much reward do we receive?
r(s, a) →
The values of different states are interrelated
Bellman’s Equation:
V π
(s) = r(s, π(s)) + γ · V π
(δ(s, π(s)))
¨Orjan Ekeberg Machine Learning
Defining the Problem
Learning
Improvements
Monte-Carlo Method
Temporal-Difference
Learning to Act
Q-Learning
1 Defining the Problem
Reward Maximization
Simplifying Assumptions
Bellman’s Equation
2 Learning
Monte-Carlo Method
Temporal-Difference
Learning to Act
Q-Learning
3 Improvements
Importance of Making Mistakes
Eligibility Trace
¨Orjan Ekeberg Machine Learning
Defining the Problem
Learning
Improvements
Monte-Carlo Method
Temporal-Difference
Learning to Act
Q-Learning
Normal scenario: r(s, a) and δ(s, a) are not known
V π must be estimated by experience
Monte-Carlo Method
Start at a random s
Follow π, store the rewards and st
When the goal is reached, update V π(s)-estimation for all
visited stated with the future reward we actually received
Painfully slow!
¨Orjan Ekeberg Machine Learning
Defining the Problem
Learning
Improvements
Monte-Carlo Method
Temporal-Difference
Learning to Act
Q-Learning
Temporal Difference Learning
−1
2
8
−4
−1
2
8
−4
6
Temporal Difference — the difference between:
Experienced value
(immediate reward + value of next state)
Expected value
Measure of unexpected success
Two sources:
Higher immediate reward than expected
Reached better situation than expected
¨Orjan Ekeberg Machine Learning
Defining the Problem
Learning
Improvements
Monte-Carlo Method
Temporal-Difference
Learning to Act
Q-Learning
Expected Value
V π
(st)
One-step Experienced Value
rt+1 + γ · V π
(st+1)
TD-signal — measure of surprise / disappointment
TD = rt+1 + γ · V π
(st+1) − V π
(st)
TD Learning
V π
(st) ← V π
(st) + ηTD
¨Orjan Ekeberg Machine Learning
Defining the Problem
Learning
Improvements
Monte-Carlo Method
Temporal-Difference
Learning to Act
Q-Learning
Learning to Act
How do we get a policy from the TD signal?
Many possibilities...
Actor–Critic Model
(Barto, Sutton, Anderson IEEE Trans. Syst. Man & Cybern. 1983)
Q-Learning
(Watkins, 1989; Watkins & Dayan Machine Learning 1992)
¨Orjan Ekeberg Machine Learning
Defining the Problem
Learning
Improvements
Monte-Carlo Method
Temporal-Difference
Learning to Act
Q-Learning
Actor–Critic Model
TDState
Reward
Action
Actor
Critic
Actor — Associates states with actions π(·)
Critic — Associates states with their value V (·)
TD-signal is used as a reinforcement signal to update both!
High TD (unexpected success)
Increase value of preceeding state
Increase tendency to make same action again
Low TD (unexpected failure)
Decrease value of preceeding state
Decrease tendency to make same action again
¨Orjan Ekeberg Machine Learning
Defining the Problem
Learning
Improvements
Monte-Carlo Method
Temporal-Difference
Learning to Act
Q-Learning
Q-Learning
Estimate Q(s, a) instead of V (s)
Q(s, a): Expected total reward when doing a from s.
π(s) = argmax
a
Q(s, a)
V (s) = max
a
Q (s, a)
The Q-function can also be learned using Temporal-Difference
Q(s, a) ← Q(s, a) + η r + γ max
a
Q(s , a ) − Q(s, a)
s is the next state.
¨Orjan Ekeberg Machine Learning
Defining the Problem
Learning
Improvements
Importance of Making Mistakes
Eligibility Trace
1 Defining the Problem
Reward Maximization
Simplifying Assumptions
Bellman’s Equation
2 Learning
Monte-Carlo Method
Temporal-Difference
Learning to Act
Q-Learning
3 Improvements
Importance of Making Mistakes
Eligibility Trace
¨Orjan Ekeberg Machine Learning
Defining the Problem
Learning
Improvements
Importance of Making Mistakes
Eligibility Trace
What do we do when...
The environment is not deterministic
The environment is not fully observable
There are way too many states
The states are not discrete
The agent is acting in continuous time
¨Orjan Ekeberg Machine Learning
Defining the Problem
Learning
Improvements
Importance of Making Mistakes
Eligibility Trace
The Exploration–Exploitation dilemma
If an agent strictly follows a greedy policy based on the current
estimate of Q, learning is not guaranteed to converge to Q
Simple solution:
Use a policy which has a certain probability of ”making mistakes”
-greedy
Sometimes (with probability ) make a random action instead
of the one that seems best (greedy)
Softmax
Assign a probability to choose each action depending on how
good they seem
¨Orjan Ekeberg Machine Learning
Defining the Problem
Learning
Improvements
Importance of Making Mistakes
Eligibility Trace
Accelerated learning
Id´ea: TD updates can be used to improve not only the last state’s
value, but also states we have visited earlier.
∀s, a : Q(s, a) ← Q(s, a) + η [rt+1 + γQ(st+1, at+1) − Q(st, at)] · e
e is a remaining trace (eligibility trace) encoding how long ago we
were in s doing a.
Often denoted TD(λ) where λ is the time constant of the trace e
¨Orjan Ekeberg Machine Learning

More Related Content

What's hot

Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
DongHyun Kwak
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learningbutest
 
An introduction to reinforcement learning
An introduction to  reinforcement learningAn introduction to  reinforcement learning
An introduction to reinforcement learning
Jie-Han Chen
 
Reinforcement learning 7313
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313Slideshare
 
Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learning
Jie-Han Chen
 
An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)
pauldix
 
Deep sarsa, Deep Q-learning, DQN
Deep sarsa, Deep Q-learning, DQNDeep sarsa, Deep Q-learning, DQN
Deep sarsa, Deep Q-learning, DQN
Euijin Jeong
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
Ding Li
 
Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning
Chandra Meena
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
Shahan Ali Memon
 
Multi armed bandit
Multi armed banditMulti armed bandit
Multi armed bandit
Jie-Han Chen
 
Deep Q-Learning
Deep Q-LearningDeep Q-Learning
Deep Q-Learning
Nikolay Pavlov
 
Reinforcement Learning Q-Learning
Reinforcement Learning   Q-Learning Reinforcement Learning   Q-Learning
Reinforcement Learning Q-Learning
Melaku Eneayehu
 
Reinforcement Learning 2. Multi-armed Bandits
Reinforcement Learning 2. Multi-armed BanditsReinforcement Learning 2. Multi-armed Bandits
Reinforcement Learning 2. Multi-armed Bandits
Seung Jae Lee
 
Markov decision process
Markov decision processMarkov decision process
Markov decision process
Hamed Abdi
 
Knowledge based agents
Knowledge based agentsKnowledge based agents
Knowledge based agents
Dr. C.V. Suresh Babu
 
Deep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsDeep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its Applications
Bill Liu
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
CloudxLab
 
Reinforcement learning
Reinforcement  learningReinforcement  learning
Reinforcement learning
SKS
 
Support Vector machine
Support Vector machineSupport Vector machine
Support Vector machine
Anandha L Ranganathan
 

What's hot (20)

Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
An introduction to reinforcement learning
An introduction to  reinforcement learningAn introduction to  reinforcement learning
An introduction to reinforcement learning
 
Reinforcement learning 7313
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313
 
Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learning
 
An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)
 
Deep sarsa, Deep Q-learning, DQN
Deep sarsa, Deep Q-learning, DQNDeep sarsa, Deep Q-learning, DQN
Deep sarsa, Deep Q-learning, DQN
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Multi armed bandit
Multi armed banditMulti armed bandit
Multi armed bandit
 
Deep Q-Learning
Deep Q-LearningDeep Q-Learning
Deep Q-Learning
 
Reinforcement Learning Q-Learning
Reinforcement Learning   Q-Learning Reinforcement Learning   Q-Learning
Reinforcement Learning Q-Learning
 
Reinforcement Learning 2. Multi-armed Bandits
Reinforcement Learning 2. Multi-armed BanditsReinforcement Learning 2. Multi-armed Bandits
Reinforcement Learning 2. Multi-armed Bandits
 
Markov decision process
Markov decision processMarkov decision process
Markov decision process
 
Knowledge based agents
Knowledge based agentsKnowledge based agents
Knowledge based agents
 
Deep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsDeep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its Applications
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Reinforcement learning
Reinforcement  learningReinforcement  learning
Reinforcement learning
 
Support Vector machine
Support Vector machineSupport Vector machine
Support Vector machine
 

Similar to Reinforcement Learning

Introduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement LearningIntroduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement Learning
IDEAS - Int'l Data Engineering and Science Association
 
Designing an AI that gains experience for absolute beginners
Designing an AI that gains experience for absolute beginnersDesigning an AI that gains experience for absolute beginners
Designing an AI that gains experience for absolute beginners
Tanzim Saqib
 
RL.ppt
RL.pptRL.ppt
RL.ppt
AzharJamil15
 
Machine Learning - Reinforcement Learning
Machine Learning - Reinforcement LearningMachine Learning - Reinforcement Learning
Machine Learning - Reinforcement Learning
JY Chun
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement Learning
Natan Katz
 
Markov chain and Monte Carlo
Markov chain and Monte CarloMarkov chain and Monte Carlo
Markov chain and Monte Carlo
Ajay Chaurasiya
 
reinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdfreinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdf
VaishnavGhadge1
 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game Learning
Prabhu Kumar
 
Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)
Dongmin Lee
 
Finalver
FinalverFinalver
Finalver
Natan Katz
 
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
MLconf
 
李宏毅课件-Regression.pdf
李宏毅课件-Regression.pdf李宏毅课件-Regression.pdf
李宏毅课件-Regression.pdf
ssusere61d07
 
Reinforcement-Learning.ppt
Reinforcement-Learning.pptReinforcement-Learning.ppt
Reinforcement-Learning.ppt
Tusharchauhan939328
 
Deep einforcement learning
Deep einforcement learningDeep einforcement learning
Deep einforcement learning
OswaldoAndrsOrdezBol
 
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Hogeon Seo
 
NUS-ISS Learning Day 2018-How to train your program to play black jack
NUS-ISS Learning Day 2018-How to train your program to play black jackNUS-ISS Learning Day 2018-How to train your program to play black jack
NUS-ISS Learning Day 2018-How to train your program to play black jack
NUS-ISS
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
NAVER Engineering
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
Farzad M. Zaravand
 
reinforcement-learning-141009013546-conversion-gate02.pptx
reinforcement-learning-141009013546-conversion-gate02.pptxreinforcement-learning-141009013546-conversion-gate02.pptx
reinforcement-learning-141009013546-conversion-gate02.pptx
MohibKhan79
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
DongHyun Kwak
 

Similar to Reinforcement Learning (20)

Introduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement LearningIntroduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement Learning
 
Designing an AI that gains experience for absolute beginners
Designing an AI that gains experience for absolute beginnersDesigning an AI that gains experience for absolute beginners
Designing an AI that gains experience for absolute beginners
 
RL.ppt
RL.pptRL.ppt
RL.ppt
 
Machine Learning - Reinforcement Learning
Machine Learning - Reinforcement LearningMachine Learning - Reinforcement Learning
Machine Learning - Reinforcement Learning
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement Learning
 
Markov chain and Monte Carlo
Markov chain and Monte CarloMarkov chain and Monte Carlo
Markov chain and Monte Carlo
 
reinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdfreinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdf
 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game Learning
 
Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)
 
Finalver
FinalverFinalver
Finalver
 
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
 
李宏毅课件-Regression.pdf
李宏毅课件-Regression.pdf李宏毅课件-Regression.pdf
李宏毅课件-Regression.pdf
 
Reinforcement-Learning.ppt
Reinforcement-Learning.pptReinforcement-Learning.ppt
Reinforcement-Learning.ppt
 
Deep einforcement learning
Deep einforcement learningDeep einforcement learning
Deep einforcement learning
 
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
 
NUS-ISS Learning Day 2018-How to train your program to play black jack
NUS-ISS Learning Day 2018-How to train your program to play black jackNUS-ISS Learning Day 2018-How to train your program to play black jack
NUS-ISS Learning Day 2018-How to train your program to play black jack
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
reinforcement-learning-141009013546-conversion-gate02.pptx
reinforcement-learning-141009013546-conversion-gate02.pptxreinforcement-learning-141009013546-conversion-gate02.pptx
reinforcement-learning-141009013546-conversion-gate02.pptx
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 

Recently uploaded

Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
pchutichetpong
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 

Recently uploaded (20)

Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 

Reinforcement Learning

  • 1. Defining the Problem Learning Improvements Reinforcement Learning ¨Orjan Ekeberg Machine Learning Defining the Problem Learning Improvements 1 Defining the Problem Reward Maximization Simplifying Assumptions Bellman’s Equation 2 Learning Monte-Carlo Method Temporal-Difference Learning to Act Q-Learning 3 Improvements Importance of Making Mistakes Eligibility Trace ¨Orjan Ekeberg Machine Learning Defining the Problem Learning Improvements Reward Maximization Simplifying Assumptions Bellman’s Equation 1 Defining the Problem Reward Maximization Simplifying Assumptions Bellman’s Equation 2 Learning Monte-Carlo Method Temporal-Difference Learning to Act Q-Learning 3 Improvements Importance of Making Mistakes Eligibility Trace ¨Orjan Ekeberg Machine Learning Defining the Problem Learning Improvements Reward Maximization Simplifying Assumptions Bellman’s Equation Reinforcement Learning Learning of a behavior without explicit information about correct actions A reward gives information about success ¨Orjan Ekeberg Machine Learning
  • 2. Defining the Problem Learning Improvements Reward Maximization Simplifying Assumptions Bellman’s Equation Agent–Environment Abstraction: Agent Environment ActionState Reward Policy — Choice of action, depending on current state Learning Objective: Develop a policy which maximizes the reward ... total reward over the agents lifetime! ¨Orjan Ekeberg Machine Learning Defining the Problem Learning Improvements Reward Maximization Simplifying Assumptions Bellman’s Equation Credit Assignment Problems The reward does not necessarily arrive when you do something good Temporal credit assignment The reward does not say what was good Structural credit assignment ¨Orjan Ekeberg Machine Learning Defining the Problem Learning Improvements Reward Maximization Simplifying Assumptions Bellman’s Equation Consider a minimal behavior selection situation: What is the best action to make? −1 2 −1 2 −1 2 −1 1 0 0 3 −1 Reward — Immediate consequence of our decision Planning — Taking future reward into account Optimal Behavior — Maximize total reward ¨Orjan Ekeberg Machine Learning Defining the Problem Learning Improvements Reward Maximization Simplifying Assumptions Bellman’s Equation Can the agent make the right decision without explicitly modeling the future? Yes! If the Sum of Future Rewards for each situation is known. −1 2 −1 1 0 0 3 −1 4 −2 2 −2 −1 2 8 −4 Call this the State Value State Values are subjective Depend on the agents own behavior Must be re-adjusted as the agents behavior improves ¨Orjan Ekeberg Machine Learning
  • 3. Defining the Problem Learning Improvements Reward Maximization Simplifying Assumptions Bellman’s Equation Planning Horizon — How long is the future? −1 0 1 2 Reward Time −1 0 1 2 Reward Time Infinite future allows for infinite postponing Discounting — Make early reward more valuable Discount factor (γ) — Time scale of planning ¨Orjan Ekeberg Machine Learning Defining the Problem Learning Improvements Reward Maximization Simplifying Assumptions Bellman’s Equation Finite Horizon max h t=0 rt Infinite Horizon max ∞ t=0 γt rt Requires discount of future reward (0 < γ < 1) ¨Orjan Ekeberg Machine Learning Defining the Problem Learning Improvements Reward Maximization Simplifying Assumptions Bellman’s Equation The Reward function controls which task should be solved Game (Chess, Backgammon) Reward only at the end: +1 when winning, −1 when loosing Avoiding mistakes (cycling, balancing, ...) Reward −1 at the end (when failing) Find a short/fast/cheap path to a goal Reward −1 at each step ¨Orjan Ekeberg Machine Learning Defining the Problem Learning Improvements Reward Maximization Simplifying Assumptions Bellman’s Equation Simplifying Assumptions Discrete time Finite number of actions ai ai ∈ a1, a2, a3, . . . , an Finite number of states si si ∈ s1, s2, s3, . . . , sm Environment is a stationary Markov Decision Process Reward and next state depends only on s, a and chance ¨Orjan Ekeberg Machine Learning
  • 4. Defining the Problem Learning Improvements Reward Maximization Simplifying Assumptions Bellman’s Equation Classical model problem: Grid World Each state is represented by a position in a grid The agent acts by moving to other positions G G Trivial labyrinth Reward: −1 at each step until a goal state (G) is reached ¨Orjan Ekeberg Machine Learning Defining the Problem Learning Improvements Reward Maximization Simplifying Assumptions Bellman’s Equation The values of a state depends on the current policy. 0 0 −1 −2 −3 −1 −2 −3 −2 −2 −3 −2 −1 −3 −2 −1 V with an optimal policy −14 −14 −14 −140 −20 −22 −18 −22 −20 0 −18−22 −22 −20 −20 V with a random policy ¨Orjan Ekeberg Machine Learning Defining the Problem Learning Improvements Reward Maximization Simplifying Assumptions Bellman’s Equation The Agents Internal Representation Policy The action chosen by the agent for each state π(s) → a Value Function Expected total future reward from s when following policy π V π (s) → ¨Orjan Ekeberg Machine Learning Defining the Problem Learning Improvements Reward Maximization Simplifying Assumptions Bellman’s Equation Representation of the Environment Where does an action take us? δ(s, a) → s How much reward do we receive? r(s, a) → The values of different states are interrelated Bellman’s Equation: V π (s) = r(s, π(s)) + γ · V π (δ(s, π(s))) ¨Orjan Ekeberg Machine Learning
  • 5. Defining the Problem Learning Improvements Monte-Carlo Method Temporal-Difference Learning to Act Q-Learning 1 Defining the Problem Reward Maximization Simplifying Assumptions Bellman’s Equation 2 Learning Monte-Carlo Method Temporal-Difference Learning to Act Q-Learning 3 Improvements Importance of Making Mistakes Eligibility Trace ¨Orjan Ekeberg Machine Learning Defining the Problem Learning Improvements Monte-Carlo Method Temporal-Difference Learning to Act Q-Learning Normal scenario: r(s, a) and δ(s, a) are not known V π must be estimated by experience Monte-Carlo Method Start at a random s Follow π, store the rewards and st When the goal is reached, update V π(s)-estimation for all visited stated with the future reward we actually received Painfully slow! ¨Orjan Ekeberg Machine Learning Defining the Problem Learning Improvements Monte-Carlo Method Temporal-Difference Learning to Act Q-Learning Temporal Difference Learning −1 2 8 −4 −1 2 8 −4 6 Temporal Difference — the difference between: Experienced value (immediate reward + value of next state) Expected value Measure of unexpected success Two sources: Higher immediate reward than expected Reached better situation than expected ¨Orjan Ekeberg Machine Learning Defining the Problem Learning Improvements Monte-Carlo Method Temporal-Difference Learning to Act Q-Learning Expected Value V π (st) One-step Experienced Value rt+1 + γ · V π (st+1) TD-signal — measure of surprise / disappointment TD = rt+1 + γ · V π (st+1) − V π (st) TD Learning V π (st) ← V π (st) + ηTD ¨Orjan Ekeberg Machine Learning
  • 6. Defining the Problem Learning Improvements Monte-Carlo Method Temporal-Difference Learning to Act Q-Learning Learning to Act How do we get a policy from the TD signal? Many possibilities... Actor–Critic Model (Barto, Sutton, Anderson IEEE Trans. Syst. Man & Cybern. 1983) Q-Learning (Watkins, 1989; Watkins & Dayan Machine Learning 1992) ¨Orjan Ekeberg Machine Learning Defining the Problem Learning Improvements Monte-Carlo Method Temporal-Difference Learning to Act Q-Learning Actor–Critic Model TDState Reward Action Actor Critic Actor — Associates states with actions π(·) Critic — Associates states with their value V (·) TD-signal is used as a reinforcement signal to update both! High TD (unexpected success) Increase value of preceeding state Increase tendency to make same action again Low TD (unexpected failure) Decrease value of preceeding state Decrease tendency to make same action again ¨Orjan Ekeberg Machine Learning Defining the Problem Learning Improvements Monte-Carlo Method Temporal-Difference Learning to Act Q-Learning Q-Learning Estimate Q(s, a) instead of V (s) Q(s, a): Expected total reward when doing a from s. π(s) = argmax a Q(s, a) V (s) = max a Q (s, a) The Q-function can also be learned using Temporal-Difference Q(s, a) ← Q(s, a) + η r + γ max a Q(s , a ) − Q(s, a) s is the next state. ¨Orjan Ekeberg Machine Learning Defining the Problem Learning Improvements Importance of Making Mistakes Eligibility Trace 1 Defining the Problem Reward Maximization Simplifying Assumptions Bellman’s Equation 2 Learning Monte-Carlo Method Temporal-Difference Learning to Act Q-Learning 3 Improvements Importance of Making Mistakes Eligibility Trace ¨Orjan Ekeberg Machine Learning
  • 7. Defining the Problem Learning Improvements Importance of Making Mistakes Eligibility Trace What do we do when... The environment is not deterministic The environment is not fully observable There are way too many states The states are not discrete The agent is acting in continuous time ¨Orjan Ekeberg Machine Learning Defining the Problem Learning Improvements Importance of Making Mistakes Eligibility Trace The Exploration–Exploitation dilemma If an agent strictly follows a greedy policy based on the current estimate of Q, learning is not guaranteed to converge to Q Simple solution: Use a policy which has a certain probability of ”making mistakes” -greedy Sometimes (with probability ) make a random action instead of the one that seems best (greedy) Softmax Assign a probability to choose each action depending on how good they seem ¨Orjan Ekeberg Machine Learning Defining the Problem Learning Improvements Importance of Making Mistakes Eligibility Trace Accelerated learning Id´ea: TD updates can be used to improve not only the last state’s value, but also states we have visited earlier. ∀s, a : Q(s, a) ← Q(s, a) + η [rt+1 + γQ(st+1, at+1) − Q(st, at)] · e e is a remaining trace (eligibility trace) encoding how long ago we were in s doing a. Often denoted TD(λ) where λ is the time constant of the trace e ¨Orjan Ekeberg Machine Learning