SlideShare a Scribd company logo
Reinforcement Learning
Russell and Norvig: ch 21
CMSC 671 – Fall 2005
Slides from Jean-Claude
Latombe and Lise Getoor
Reinforcement Learning
Supervised (inductive) learning is the simplest and
most studied type of learning
How can an agent learn behaviors when it doesn’t
have a teacher to tell it how to perform?
 The agent has a task to perform
 It takes some actions in the world
 At some later point, it gets feedback telling it how well it did
on performing the task
 The agent performs the same task over and over again
This problem is called reinforcement learning:
 The agent gets positive reinforcement for tasks done well
 The agent gets negative reinforcement for tasks done poorly
Reinforcement Learning (cont.)
The goal is to get the agent to act in the
world so as to maximize its rewards
The agent has to figure out what it did that
made it get the reward/punishment
 This is known as the credit assignment problem
Reinforcement learning approaches can be
used to train computers to do many tasks
 backgammon and chess playing
 job shop scheduling
 controlling robot limbs
Reinforcement learning on the
web
Nifty applets:
 for blackjack
 for robot motion
 for a pendulum controller
Formalization
Given:
 a state space S
 a set of actions a1, …, ak
 reward value at the end of each trial (may
be positive or negative)
Output:
 a mapping from states to actions
example: Alvinn (driving agent)
state: configuration of the car
learn a steering action for each state
Accessible or
observable state
Repeat:
 s  sensed state
 If s is terminal then exit
 a  choose action (given s)
 Perform a
Reactive Agent Algorithm
Policy (Reactive/Closed-Loop Strategy)
• A policy P is a complete mapping from states to actions
-1
+1
2
3
1
4
3
2
1
Repeat:
 s  sensed state
 If s is terminal then exit
 a  P(s)
 Perform a
Reactive Agent Algorithm
Approaches
Learn policy directly– function mapping
from states to actions
Learn utility values for states (i.e., the
value function)
Value Function
The agent knows what state it is in
The agent has a number of actions it can perform in
each state.
Initially, it doesn't know the value of any of the states
If the outcome of performing an action at a state is
deterministic, then the agent can update the utility
value U() of states:
 U(oldstate) = reward + U(newstate)
The agent learns the utility values of states as it
works its way through the state space
Exploration
The agent may occasionally choose to explore
suboptimal moves in the hopes of finding better
outcomes
 Only by visiting all the states frequently enough can we
guarantee learning the true values of all the states
A discount factor is often introduced to prevent utility
values from diverging and to promote the use of
shorter (more efficient) sequences of actions to
attain rewards
The update equation using a discount factor  is:
 U(oldstate) = reward +  * U(newstate)
Normally,  is set between 0 and 1
Q-Learning
Q-learning augments value iteration by
maintaining an estimated utility value
Q(s,a) for every action at every state
The utility of a state U(s), or Q(s), is
simply the maximum Q value over all
the possible actions at that state
Learns utilities of actions (not states) 
model-free learning
Q-Learning
foreach state s
foreach action a
Q(s,a)=0
s=currentstate
do forever
a = select an action
do action a
r = reward from doing a
t = resulting state from doing a
Q(s,a) = (1 – ) Q(s,a) +  (r +  Q(t))
s = t
The learning coefficient, , determines how quickly our
estimates are updated
Normally,  is set to a small positive constant less than
1
Selecting an Action
Simply choose action with highest (current)
expected utility?
Problem: each action has two effects
 yields a reward (or penalty) on current sequence
 information is received and used in learning for
future sequences
Trade-off: immediate good for long-term well-
being
stuck in a rut
try a shortcut – you might get lost;
you might learn a new, quicker route!
Exploration policy
Wacky approach (exploration): act randomly
in hopes of eventually exploring entire
environment
Greedy approach (exploitation): act to
maximize utility using current estimate
Reasonable balance: act more wacky
(exploratory) when agent has little idea of
environment; more greedy when the model is
close to correct
Example: n-armed bandits…
RL Summary
Active area of research
Approaches from both OR and AI
There are many more sophisticated
algorithms that we have not discussed
Applicable to game-playing, robot
controllers, others

More Related Content

Similar to RL.ppt

Q_Learning.ppt
Q_Learning.pptQ_Learning.ppt
Q_Learning.ppt
AyushGiri27
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
SVijaylakshmi
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement Learning
Natan Katz
 
Head First Reinforcement Learning
Head First Reinforcement LearningHead First Reinforcement Learning
Head First Reinforcement Learning
azzeddine chenine
 
Lecture notes
Lecture notesLecture notes
Lecture notes
butest
 
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
MLconf
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
Ding Li
 
Intro rl
Intro rlIntro rl
Intro rl
Ronald Teo
 
Reinforcement learning 7313
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313
Slideshare
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
MLconf
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
Elias Hasnat
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
Salem-Kabbani
 
CS799_FinalReport
CS799_FinalReportCS799_FinalReport
CS799_FinalReport
Abhanshu Gupta
 
RL_Dr.SNR Final ppt for Presentation 28.05.2021.pptx
RL_Dr.SNR Final ppt for Presentation 28.05.2021.pptxRL_Dr.SNR Final ppt for Presentation 28.05.2021.pptx
RL_Dr.SNR Final ppt for Presentation 28.05.2021.pptx
deeplearning6
 
Introduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement LearningIntroduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement Learning
IDEAS - Int'l Data Engineering and Science Association
 
Lecture 1 - introduction.pdf
Lecture 1 - introduction.pdfLecture 1 - introduction.pdf
Lecture 1 - introduction.pdf
NamanJain758248
 
Deep einforcement learning
Deep einforcement learningDeep einforcement learning
Deep einforcement learning
OswaldoAndrsOrdezBol
 
lecture_21.pptx - PowerPoint Presentation
lecture_21.pptx - PowerPoint Presentationlecture_21.pptx - PowerPoint Presentation
lecture_21.pptx - PowerPoint Presentation
butest
 
Intro to Reinforcement Learning
Intro to Reinforcement LearningIntro to Reinforcement Learning
Intro to Reinforcement Learning
Utkarsh Garg
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
Usman Qayyum
 

Similar to RL.ppt (20)

Q_Learning.ppt
Q_Learning.pptQ_Learning.ppt
Q_Learning.ppt
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement Learning
 
Head First Reinforcement Learning
Head First Reinforcement LearningHead First Reinforcement Learning
Head First Reinforcement Learning
 
Lecture notes
Lecture notesLecture notes
Lecture notes
 
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Intro rl
Intro rlIntro rl
Intro rl
 
Reinforcement learning 7313
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
CS799_FinalReport
CS799_FinalReportCS799_FinalReport
CS799_FinalReport
 
RL_Dr.SNR Final ppt for Presentation 28.05.2021.pptx
RL_Dr.SNR Final ppt for Presentation 28.05.2021.pptxRL_Dr.SNR Final ppt for Presentation 28.05.2021.pptx
RL_Dr.SNR Final ppt for Presentation 28.05.2021.pptx
 
Introduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement LearningIntroduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement Learning
 
Lecture 1 - introduction.pdf
Lecture 1 - introduction.pdfLecture 1 - introduction.pdf
Lecture 1 - introduction.pdf
 
Deep einforcement learning
Deep einforcement learningDeep einforcement learning
Deep einforcement learning
 
lecture_21.pptx - PowerPoint Presentation
lecture_21.pptx - PowerPoint Presentationlecture_21.pptx - PowerPoint Presentation
lecture_21.pptx - PowerPoint Presentation
 
Intro to Reinforcement Learning
Intro to Reinforcement LearningIntro to Reinforcement Learning
Intro to Reinforcement Learning
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 

Recently uploaded

How to Make a Field Mandatory in Odoo 17
How to Make a Field Mandatory in Odoo 17How to Make a Field Mandatory in Odoo 17
How to Make a Field Mandatory in Odoo 17
Celine George
 
S1-Introduction-Biopesticides in ICM.pptx
S1-Introduction-Biopesticides in ICM.pptxS1-Introduction-Biopesticides in ICM.pptx
S1-Introduction-Biopesticides in ICM.pptx
tarandeep35
 
How to Manage Your Lost Opportunities in Odoo 17 CRM
How to Manage Your Lost Opportunities in Odoo 17 CRMHow to Manage Your Lost Opportunities in Odoo 17 CRM
How to Manage Your Lost Opportunities in Odoo 17 CRM
Celine George
 
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptxChapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
Mohd Adib Abd Muin, Senior Lecturer at Universiti Utara Malaysia
 
A Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdfA Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdf
Jean Carlos Nunes Paixão
 
BBR 2024 Summer Sessions Interview Training
BBR  2024 Summer Sessions Interview TrainingBBR  2024 Summer Sessions Interview Training
BBR 2024 Summer Sessions Interview Training
Katrina Pritchard
 
Main Java[All of the Base Concepts}.docx
Main Java[All of the Base Concepts}.docxMain Java[All of the Base Concepts}.docx
Main Java[All of the Base Concepts}.docx
adhitya5119
 
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
Nguyen Thanh Tu Collection
 
World environment day ppt For 5 June 2024
World environment day ppt For 5 June 2024World environment day ppt For 5 June 2024
World environment day ppt For 5 June 2024
ak6969907
 
How to Setup Warehouse & Location in Odoo 17 Inventory
How to Setup Warehouse & Location in Odoo 17 InventoryHow to Setup Warehouse & Location in Odoo 17 Inventory
How to Setup Warehouse & Location in Odoo 17 Inventory
Celine George
 
Liberal Approach to the Study of Indian Politics.pdf
Liberal Approach to the Study of Indian Politics.pdfLiberal Approach to the Study of Indian Politics.pdf
Liberal Approach to the Study of Indian Politics.pdf
WaniBasim
 
How to Fix the Import Error in the Odoo 17
How to Fix the Import Error in the Odoo 17How to Fix the Import Error in the Odoo 17
How to Fix the Import Error in the Odoo 17
Celine George
 
Life upper-Intermediate B2 Workbook for student
Life upper-Intermediate B2 Workbook for studentLife upper-Intermediate B2 Workbook for student
Life upper-Intermediate B2 Workbook for student
NgcHiNguyn25
 
The basics of sentences session 6pptx.pptx
The basics of sentences session 6pptx.pptxThe basics of sentences session 6pptx.pptx
The basics of sentences session 6pptx.pptx
heathfieldcps1
 
PIMS Job Advertisement 2024.pdf Islamabad
PIMS Job Advertisement 2024.pdf IslamabadPIMS Job Advertisement 2024.pdf Islamabad
PIMS Job Advertisement 2024.pdf Islamabad
AyyanKhan40
 
Smart-Money for SMC traders good time and ICT
Smart-Money for SMC traders good time and ICTSmart-Money for SMC traders good time and ICT
Smart-Money for SMC traders good time and ICT
simonomuemu
 
Azure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHatAzure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHat
Scholarhat
 
The History of Stoke Newington Street Names
The History of Stoke Newington Street NamesThe History of Stoke Newington Street Names
The History of Stoke Newington Street Names
History of Stoke Newington
 
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Dr. Vinod Kumar Kanvaria
 
Advanced Java[Extra Concepts, Not Difficult].docx
Advanced Java[Extra Concepts, Not Difficult].docxAdvanced Java[Extra Concepts, Not Difficult].docx
Advanced Java[Extra Concepts, Not Difficult].docx
adhitya5119
 

Recently uploaded (20)

How to Make a Field Mandatory in Odoo 17
How to Make a Field Mandatory in Odoo 17How to Make a Field Mandatory in Odoo 17
How to Make a Field Mandatory in Odoo 17
 
S1-Introduction-Biopesticides in ICM.pptx
S1-Introduction-Biopesticides in ICM.pptxS1-Introduction-Biopesticides in ICM.pptx
S1-Introduction-Biopesticides in ICM.pptx
 
How to Manage Your Lost Opportunities in Odoo 17 CRM
How to Manage Your Lost Opportunities in Odoo 17 CRMHow to Manage Your Lost Opportunities in Odoo 17 CRM
How to Manage Your Lost Opportunities in Odoo 17 CRM
 
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptxChapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
 
A Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdfA Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdf
 
BBR 2024 Summer Sessions Interview Training
BBR  2024 Summer Sessions Interview TrainingBBR  2024 Summer Sessions Interview Training
BBR 2024 Summer Sessions Interview Training
 
Main Java[All of the Base Concepts}.docx
Main Java[All of the Base Concepts}.docxMain Java[All of the Base Concepts}.docx
Main Java[All of the Base Concepts}.docx
 
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
 
World environment day ppt For 5 June 2024
World environment day ppt For 5 June 2024World environment day ppt For 5 June 2024
World environment day ppt For 5 June 2024
 
How to Setup Warehouse & Location in Odoo 17 Inventory
How to Setup Warehouse & Location in Odoo 17 InventoryHow to Setup Warehouse & Location in Odoo 17 Inventory
How to Setup Warehouse & Location in Odoo 17 Inventory
 
Liberal Approach to the Study of Indian Politics.pdf
Liberal Approach to the Study of Indian Politics.pdfLiberal Approach to the Study of Indian Politics.pdf
Liberal Approach to the Study of Indian Politics.pdf
 
How to Fix the Import Error in the Odoo 17
How to Fix the Import Error in the Odoo 17How to Fix the Import Error in the Odoo 17
How to Fix the Import Error in the Odoo 17
 
Life upper-Intermediate B2 Workbook for student
Life upper-Intermediate B2 Workbook for studentLife upper-Intermediate B2 Workbook for student
Life upper-Intermediate B2 Workbook for student
 
The basics of sentences session 6pptx.pptx
The basics of sentences session 6pptx.pptxThe basics of sentences session 6pptx.pptx
The basics of sentences session 6pptx.pptx
 
PIMS Job Advertisement 2024.pdf Islamabad
PIMS Job Advertisement 2024.pdf IslamabadPIMS Job Advertisement 2024.pdf Islamabad
PIMS Job Advertisement 2024.pdf Islamabad
 
Smart-Money for SMC traders good time and ICT
Smart-Money for SMC traders good time and ICTSmart-Money for SMC traders good time and ICT
Smart-Money for SMC traders good time and ICT
 
Azure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHatAzure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHat
 
The History of Stoke Newington Street Names
The History of Stoke Newington Street NamesThe History of Stoke Newington Street Names
The History of Stoke Newington Street Names
 
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
 
Advanced Java[Extra Concepts, Not Difficult].docx
Advanced Java[Extra Concepts, Not Difficult].docxAdvanced Java[Extra Concepts, Not Difficult].docx
Advanced Java[Extra Concepts, Not Difficult].docx
 

RL.ppt

  • 1. Reinforcement Learning Russell and Norvig: ch 21 CMSC 671 – Fall 2005 Slides from Jean-Claude Latombe and Lise Getoor
  • 2. Reinforcement Learning Supervised (inductive) learning is the simplest and most studied type of learning How can an agent learn behaviors when it doesn’t have a teacher to tell it how to perform?  The agent has a task to perform  It takes some actions in the world  At some later point, it gets feedback telling it how well it did on performing the task  The agent performs the same task over and over again This problem is called reinforcement learning:  The agent gets positive reinforcement for tasks done well  The agent gets negative reinforcement for tasks done poorly
  • 3. Reinforcement Learning (cont.) The goal is to get the agent to act in the world so as to maximize its rewards The agent has to figure out what it did that made it get the reward/punishment  This is known as the credit assignment problem Reinforcement learning approaches can be used to train computers to do many tasks  backgammon and chess playing  job shop scheduling  controlling robot limbs
  • 4. Reinforcement learning on the web Nifty applets:  for blackjack  for robot motion  for a pendulum controller
  • 5. Formalization Given:  a state space S  a set of actions a1, …, ak  reward value at the end of each trial (may be positive or negative) Output:  a mapping from states to actions example: Alvinn (driving agent) state: configuration of the car learn a steering action for each state
  • 6. Accessible or observable state Repeat:  s  sensed state  If s is terminal then exit  a  choose action (given s)  Perform a Reactive Agent Algorithm
  • 7. Policy (Reactive/Closed-Loop Strategy) • A policy P is a complete mapping from states to actions -1 +1 2 3 1 4 3 2 1
  • 8. Repeat:  s  sensed state  If s is terminal then exit  a  P(s)  Perform a Reactive Agent Algorithm
  • 9. Approaches Learn policy directly– function mapping from states to actions Learn utility values for states (i.e., the value function)
  • 10. Value Function The agent knows what state it is in The agent has a number of actions it can perform in each state. Initially, it doesn't know the value of any of the states If the outcome of performing an action at a state is deterministic, then the agent can update the utility value U() of states:  U(oldstate) = reward + U(newstate) The agent learns the utility values of states as it works its way through the state space
  • 11. Exploration The agent may occasionally choose to explore suboptimal moves in the hopes of finding better outcomes  Only by visiting all the states frequently enough can we guarantee learning the true values of all the states A discount factor is often introduced to prevent utility values from diverging and to promote the use of shorter (more efficient) sequences of actions to attain rewards The update equation using a discount factor  is:  U(oldstate) = reward +  * U(newstate) Normally,  is set between 0 and 1
  • 12. Q-Learning Q-learning augments value iteration by maintaining an estimated utility value Q(s,a) for every action at every state The utility of a state U(s), or Q(s), is simply the maximum Q value over all the possible actions at that state Learns utilities of actions (not states)  model-free learning
  • 13. Q-Learning foreach state s foreach action a Q(s,a)=0 s=currentstate do forever a = select an action do action a r = reward from doing a t = resulting state from doing a Q(s,a) = (1 – ) Q(s,a) +  (r +  Q(t)) s = t The learning coefficient, , determines how quickly our estimates are updated Normally,  is set to a small positive constant less than 1
  • 14. Selecting an Action Simply choose action with highest (current) expected utility? Problem: each action has two effects  yields a reward (or penalty) on current sequence  information is received and used in learning for future sequences Trade-off: immediate good for long-term well- being stuck in a rut try a shortcut – you might get lost; you might learn a new, quicker route!
  • 15. Exploration policy Wacky approach (exploration): act randomly in hopes of eventually exploring entire environment Greedy approach (exploitation): act to maximize utility using current estimate Reasonable balance: act more wacky (exploratory) when agent has little idea of environment; more greedy when the model is close to correct Example: n-armed bandits…
  • 16. RL Summary Active area of research Approaches from both OR and AI There are many more sophisticated algorithms that we have not discussed Applicable to game-playing, robot controllers, others