SlideShare a Scribd company logo
1 of 24
Download to read offline
Reinforcement Learning
Dr. P. Kuppusamy
Prof / CSE
Reinforcement learning
• Reinforcement Learning is the agent must sense the environment,
learns to behave (act) in a environment by performing actions
(reinforcement) and seeing the results.
• Task
- Learn how to behave successfully to achieve a goal while
interacting with an external environment.
- The goal of the agent is to learn an action policy that
maximizes the total reward it will receive from any starting
state.
• Examples
– Game playing: player knows whether it win or lose, but not
know how to move at each step
Applications
• A robot cleaning the room and recharging its battery
• Robot-soccer
• invest in shares
• Modeling the economy through rational agents
• Learning how to fly a helicopter
• Scheduling planes to their destinations
Reinforcement Learning Process
• RL contains two primary components:
1. Agent (A) – RL algorithm that learns from trial and error
2. Environment – World Space in which the agent moves (interact and take action)
• State (S) – Current situation returned by the environment
• Reward (R) – An immediate return from the environment to appraise the last action
• Policy (π) –Agent uses this approach to decide the next action based on the current state
• Value (V) – Expected long-term return with discount. Oppose to the short-term reward (R)
• Action-Value (Q) – Similar to value except it contains an additional parameter, the current
action (A)
Figure shows RL is learning from interaction
RL Approaches
• Two approaches
– Model based approach RL:
• learn the model, and use it to derive the optimal policy.
e.g Adaptive dynamic learning(ADP) approach
– Model free approach RL:
• derive the optimal policy without learning the model.
e.g LMS and Temporal difference approach
• Passive learning
– The agent imply watches the world during transition and tries to
learn the utilities in various states
• Active learning
– The agent not simply watches, but also acts on the environment
Example
:S A
 
• Immediate reward is worth more than future reward.
• Reward Maximization - Agent is trained to take best (optimal)
action to get maximum reward
Reward Maximization
:S A
 
• Exploration – Search and capture more information about the
environment
• Exploitation – Use the already known information to maximize the
rewards
Reinforcement learning model
• Each percept(e) is enough to determine the State(the state is
accessible)
• Agent’s task: Find a optimal policy by mapping states of
environment to actions of the agent, that maximize long-run
measure of the reward (reinforcement)
• It can be modeled as Markov Decision Process (MDP) model.
• Markov decision process (MDP) is a a mathematical
framework for modeling decision making i.e mapping a
solution in reinforcement learning.
MDP model
• MDP model <S,T,A,R>
Agent
Environment
State
Reward
Action
s0
r0
a0
s1
a1
r1
s2
a2
r2
s3
• S– set of states
• A– set of actions
• Transition Function: T(s,a,s’) =
P(s’|s,a) – the probability of transition
from s to s’ given action a
T(s,a)  s’
• Reward Function: r(s,a)  r the
expected reward for taking action a in
state s




'
'
)
'
,
,
(
)
'
,
,
(
)
,
(
)
'
,
,
(
)
,
|
'
(
)
,
(
s
s
s
a
s
r
s
a
s
T
a
s
R
s
a
s
r
a
s
s
P
a
s
R
MDP - Example I
• Consider the graph, and find the shortest path from a node S to a
goal node G.
• Set of states {S, T, U, V}
• Action – Traversal from one state to another state
• Reward - Traversing an edge provides “length edge” in dollars.
• Policy – Path considered to reach the destination {STV}
G
S
U
T
V
14
51
25
15
-5
-22
Q - Learning
• Q-Learning is a value-based reinforcement learning algorithm uses Q-
values (action values) to iteratively improve the behavior of the learning
agent.
• Goal is to maximize the Q value to find the optimal action-selection policy.
• The Q table helps to find the best action for each state and maximize the
expected reward.
• Q-Values / Action-Values: Q-values are defined for states and actions.
• Q(s, a) denotes an estimation of the action a at the state s.
• This estimation of Q(s, a) will be iteratively computed using the TD-
Update rule.
• Reward: At every transition, the agent observes a reward for every action
from the environment, and then transits to another state.
• Episode: If at any point of time the agent ends up in one of the terminating
states i.e. there are no further transition possible is called completion of an
episode.
Q-Learning
• Initially agent explore the environment and update the Q-Table. When the
Q-Table is ready, the agent will start to exploit the environment and taking
better actions.
• It is an off-policy control algorithm i.e. the updated policy is different
from the behavior policy. It estimates the reward for future actions and
appends a value to the new state without any greedy policy
Temporal Difference or TD-Update:
• Estimate the value of Q is applied at every time step of the agents
interaction with the environment
Advantage:
• Converges to an optimal policy in both deterministic and nondeterministic
MDPs.
Disadvantage:
• Suitable for small problems.
Understanding the Q – Learning
• Building Environment contains 5 rooms that are connected with doors.
• Each room is numbered from 0 to 4. The building outside is numbered as 5.
• Doors from room 1 and 4 leads to the building outside 5.
• Problem: Agent can place at any one of the rooms (0, 1, 2, 3, 4). Agent’s
goal is to reach the building outside (room 5).
Understanding the Q – Learning
• Represent the room in the graph.
• Room number is the state and door is the edge.
Understanding the Q – Learning
• Assign the Reward value to each
door.
• The doors lead immediately to
target is assigned an instant reward
of 100.
• Other doors not directly connected
to the target room have zero
reward.
• For example, doors are two-way ( 0
leads to 4, and 4 leads back to 0 ),
two edges are assigned to each
room.
• Each edge contains an instant
reward value
Understanding the Q – Learning
• Let consider agent starts from state s (Room) 2.
• Agent’s movement from one state to another state is action a.
• Agent is traversing from state 2 to state 5 (Target).
– Initial state = current state i.e. state 2
– Transition State 2  State 3
– Transition State 3  State (2, 1, 4)
– Transition State 4  State 5
Understanding the Q – Learning
Understanding the Q – Learning: Prepare matrix Q
• Matrix Q is the memory of the agent in which learned information
from experience is stored.
• Row denotes the current state of the agent
• Column denotes the possible actions leading to the next state
Compute Q matrix:
Q(state, action) = R(state, action) + Gamma * max[Q(next state, all actions)]
• Gamma is discounting factor for future rewards. Its range is 0 to 1.
i.e. 0 < Gamma <1.
• Future rewards are less valuable than current rewards so they must
be discounted.
• If Gamma is closer to 0, the agent will tend to consider only the
immediate rewards.
• If Gamma is closer to 1, the agent will tend to consider only future
rewards with higher edge weights.
Q – Learning Algorithm
• Set the gamma parameter
• Set environment rewards in matrix R
• Initialize matrix Q as Zero
– Select random initial (source) state
• Set initial state s = current state
– Select one action a among all possible actions using exploratory policy
• Take this possible action a, going to the next state s’.
• Observe reward r
– Get maximum Q value to go to next state based on all possible
actions
• Compute:
– Q(state, action) = R(state, action) + Gamma * max[Q(next state, all actions)]
• Repeat the above steps until reach the goal state i.e current state =
goal state
Example: Q – Learning
0 1 2 3 4 5
Action
0 1 2 3 4 5
State
Example: Q – Learning
0 1 2 3 4 5
Action
0 1 2 3 4 5
State
3
1 5
5
1
4
𝑅 =
𝟎
𝟏
𝟐
𝟑
𝟒
𝟓
−1 −1 −1
−1 −1 −1
−1 −1 −1
−1 0 −1
0 −1 100
0 −1 100
−1 0 0
0 −1 −1
−1 0 −1
−1 0 −1
0 −1 100
−1 0 100
Q =
𝟎
𝟏
𝟐
𝟑
𝟒
𝟓
0 0 0
0 0 0
0 0 0
0 0 0
0 0 𝟏𝟎𝟎
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
• Update the Matrix Q.
• For next episode, choose next state 3 randomly that
becomes current state.
• State 3 contains 3 choices i.e. state 1, 2 or 4.
• Let’s choose state 1.
• Compute max Q value to go to next state based
on all possible actions.
Q(state, action) = R(state, action) + Gamma * max[Q(next state, all actions)]
• Q(3,1) = R(3,1) + 0.8 * max[Q(1,3), Q(1,5)]
= 0 + 0.8 * max[0, 100] = 0 + 80 = 80
• Update the Matrix Q.
0 1 2 3 4 5
Action
0 1 2 3 4 5
State
𝑅 =
𝟎
𝟏
𝟐
𝟑
𝟒
𝟓
−1 −1 −1
−1 −1 −1
−1 −1 −1
−1 0 −1
0 −1 100
0 −1 100
−1 0 0
0 −1 −1
−1 0 −1
−1 0 −1
0 −1 100
−1 0 100
Q =
𝟎
𝟏
𝟐
𝟑
𝟒
𝟓
0 0 0
0 0 0
0 0 0
0 0 0
0 0 100
0 0 0
0 𝟖𝟎 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
3
3
5
4
1
2
• For next episode, next state 1 becomes current state
• Repeat the inner loop due to 1 is not target state
• From State 1, either can go to 3 or 5.
• Let’s choose state 5.
• Compute max Q value to go to next state based
on all possible actions.
• Q(state, action) = R(state, action) + Gamma * max[Q(next state, all actions)]
• Q(1,5) = R(1,5) + 0.8 * max[Q(5,1), Q(5,4), Q(5,5)]
= 100 + 0.8 * max[0, 0, 0] = 100 + 0 = 100
• Q remains the same due to Q(1,5) is already fed into the agent. Stop process
0 1 2 3 4 5
Action
0 1 2 3 4 5
State
𝑅 =
𝟎
𝟏
𝟐
𝟑
𝟒
𝟓
−1 −1 −1
−1 −1 −1
−1 −1 −1
−1 0 −1
0 −1 100
0 −1 100
−1 0 0
0 −1 −1
−1 0 −1
−1 0 −1
0 −1 100
−1 0 100
Q =
𝟎
𝟏
𝟐
𝟑
𝟒
𝟓
0 0 0
0 0 0
0 0 0
0 0 0
0 0 100
0 0 0
0 𝟖𝟎 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
3
1 5
5
1
4
References
• Tom Markiewicz& Josh Zheng,Getting started with Artificial Intelligence,
Published by O’Reilly Media,2017
• Stuart J. Russell and Peter Norvig,Artificial Intelligence A Modern
Approach
• Richard Szeliski, Computer Vision: Algorithms and Applications, Springer
2010

More Related Content

What's hot

Adversarial search
Adversarial searchAdversarial search
Adversarial searchNilu Desai
 
Reinforcement learning 7313
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313Slideshare
 
Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning Chandra Meena
 
Deep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsDeep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsBill Liu
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reductionmrizwan969
 
An introduction to reinforcement learning
An introduction to  reinforcement learningAn introduction to  reinforcement learning
An introduction to reinforcement learningJie-Han Chen
 
Reinforcement learning
Reinforcement  learningReinforcement  learning
Reinforcement learningSKS
 
Search problems in Artificial Intelligence
Search problems in Artificial IntelligenceSearch problems in Artificial Intelligence
Search problems in Artificial Intelligenceananth
 
Feedforward neural network
Feedforward neural networkFeedforward neural network
Feedforward neural networkSopheaktra YONG
 
Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...
Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...
Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...Simplilearn
 
Logistic regression in Machine Learning
Logistic regression in Machine LearningLogistic regression in Machine Learning
Logistic regression in Machine LearningKuppusamy P
 
Hill climbing algorithm in artificial intelligence
Hill climbing algorithm in artificial intelligenceHill climbing algorithm in artificial intelligence
Hill climbing algorithm in artificial intelligencesandeep54552
 
Machine Learning with Decision trees
Machine Learning with Decision treesMachine Learning with Decision trees
Machine Learning with Decision treesKnoldus Inc.
 

What's hot (20)

Adversarial search
Adversarial searchAdversarial search
Adversarial search
 
AI Lecture 3 (solving problems by searching)
AI Lecture 3 (solving problems by searching)AI Lecture 3 (solving problems by searching)
AI Lecture 3 (solving problems by searching)
 
Reinforcement learning 7313
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313
 
Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning
 
Deep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsDeep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its Applications
 
Problem Solving
Problem Solving Problem Solving
Problem Solving
 
Reinforcement Learning - DQN
Reinforcement Learning - DQNReinforcement Learning - DQN
Reinforcement Learning - DQN
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
 
Hill climbing
Hill climbingHill climbing
Hill climbing
 
An introduction to reinforcement learning
An introduction to  reinforcement learningAn introduction to  reinforcement learning
An introduction to reinforcement learning
 
Reinforcement learning
Reinforcement  learningReinforcement  learning
Reinforcement learning
 
Support Vector Machines ( SVM )
Support Vector Machines ( SVM ) Support Vector Machines ( SVM )
Support Vector Machines ( SVM )
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
Search problems in Artificial Intelligence
Search problems in Artificial IntelligenceSearch problems in Artificial Intelligence
Search problems in Artificial Intelligence
 
Deep Q-Learning
Deep Q-LearningDeep Q-Learning
Deep Q-Learning
 
Feedforward neural network
Feedforward neural networkFeedforward neural network
Feedforward neural network
 
Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...
Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...
Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...
 
Logistic regression in Machine Learning
Logistic regression in Machine LearningLogistic regression in Machine Learning
Logistic regression in Machine Learning
 
Hill climbing algorithm in artificial intelligence
Hill climbing algorithm in artificial intelligenceHill climbing algorithm in artificial intelligence
Hill climbing algorithm in artificial intelligence
 
Machine Learning with Decision trees
Machine Learning with Decision treesMachine Learning with Decision trees
Machine Learning with Decision trees
 

Similar to Reinforcement learning, Q-Learning

Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement LearningNatan Katz
 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningPrabhu Kumar
 
24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptxManiMaran230751
 
Head First Reinforcement Learning
Head First Reinforcement LearningHead First Reinforcement Learning
Head First Reinforcement Learningazzeddine chenine
 
Reinforcement Learning Guide For Beginners
Reinforcement Learning Guide For BeginnersReinforcement Learning Guide For Beginners
Reinforcement Learning Guide For Beginnersgokulprasath06
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningElias Hasnat
 
Survey of Modern Reinforcement Learning
Survey of Modern Reinforcement Learning Survey of Modern Reinforcement Learning
Survey of Modern Reinforcement Learning Julia Maddalena
 
Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning재연 윤
 
reinforcement-learning its based on the slide of university
reinforcement-learning its based on the slide of universityreinforcement-learning its based on the slide of university
reinforcement-learning its based on the slide of universityMOHDNADEEM971008
 
reinforcement-learning.ppt
reinforcement-learning.pptreinforcement-learning.ppt
reinforcement-learning.ppthemalathache
 
Lecture notes
Lecture notesLecture notes
Lecture notesbutest
 
Reinforcement learning.pptx
Reinforcement learning.pptxReinforcement learning.pptx
Reinforcement learning.pptxaniketgupta16440
 

Similar to Reinforcement learning, Q-Learning (20)

Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement Learning
 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game Learning
 
24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx
 
Head First Reinforcement Learning
Head First Reinforcement LearningHead First Reinforcement Learning
Head First Reinforcement Learning
 
Reinforcement Learning Guide For Beginners
Reinforcement Learning Guide For BeginnersReinforcement Learning Guide For Beginners
Reinforcement Learning Guide For Beginners
 
Finalver
FinalverFinalver
Finalver
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Survey of Modern Reinforcement Learning
Survey of Modern Reinforcement Learning Survey of Modern Reinforcement Learning
Survey of Modern Reinforcement Learning
 
Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning
 
CH2_AI_Lecture1.ppt
CH2_AI_Lecture1.pptCH2_AI_Lecture1.ppt
CH2_AI_Lecture1.ppt
 
reinforcement-learning its based on the slide of university
reinforcement-learning its based on the slide of universityreinforcement-learning its based on the slide of university
reinforcement-learning its based on the slide of university
 
reinforcement-learning.ppt
reinforcement-learning.pptreinforcement-learning.ppt
reinforcement-learning.ppt
 
Lecture notes
Lecture notesLecture notes
Lecture notes
 
RL.ppt
RL.pptRL.ppt
RL.ppt
 
Making Complex Decisions(Artificial Intelligence)
Making Complex Decisions(Artificial Intelligence)Making Complex Decisions(Artificial Intelligence)
Making Complex Decisions(Artificial Intelligence)
 
Introduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement LearningIntroduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement Learning
 
Reinforcement learning.pptx
Reinforcement learning.pptxReinforcement learning.pptx
Reinforcement learning.pptx
 
Hill climbing algorithm
Hill climbing algorithmHill climbing algorithm
Hill climbing algorithm
 
Fundamentals of RL.pptx
Fundamentals of RL.pptxFundamentals of RL.pptx
Fundamentals of RL.pptx
 
Deep RL.pdf
Deep RL.pdfDeep RL.pdf
Deep RL.pdf
 

More from Kuppusamy P

Recurrent neural networks rnn
Recurrent neural networks   rnnRecurrent neural networks   rnn
Recurrent neural networks rnnKuppusamy P
 
Image segmentation
Image segmentationImage segmentation
Image segmentationKuppusamy P
 
Image enhancement
Image enhancementImage enhancement
Image enhancementKuppusamy P
 
Feature detection and matching
Feature detection and matchingFeature detection and matching
Feature detection and matchingKuppusamy P
 
Image processing, Noise, Noise Removal filters
Image processing, Noise, Noise Removal filtersImage processing, Noise, Noise Removal filters
Image processing, Noise, Noise Removal filtersKuppusamy P
 
Flowchart design for algorithms
Flowchart design for algorithmsFlowchart design for algorithms
Flowchart design for algorithmsKuppusamy P
 
Algorithm basics
Algorithm basicsAlgorithm basics
Algorithm basicsKuppusamy P
 
Problem solving using Programming
Problem solving using ProgrammingProblem solving using Programming
Problem solving using ProgrammingKuppusamy P
 
Parts of Computer, Hardware and Software
Parts of Computer, Hardware and Software Parts of Computer, Hardware and Software
Parts of Computer, Hardware and Software Kuppusamy P
 
Java methods or Subroutines or Functions
Java methods or Subroutines or FunctionsJava methods or Subroutines or Functions
Java methods or Subroutines or FunctionsKuppusamy P
 
Java iterative statements
Java iterative statementsJava iterative statements
Java iterative statementsKuppusamy P
 
Java conditional statements
Java conditional statementsJava conditional statements
Java conditional statementsKuppusamy P
 
Java introduction
Java introductionJava introduction
Java introductionKuppusamy P
 
Anomaly detection (Unsupervised Learning) in Machine Learning
Anomaly detection (Unsupervised Learning) in Machine LearningAnomaly detection (Unsupervised Learning) in Machine Learning
Anomaly detection (Unsupervised Learning) in Machine LearningKuppusamy P
 
Machine Learning Performance metrics for classification
Machine Learning Performance metrics for classificationMachine Learning Performance metrics for classification
Machine Learning Performance metrics for classificationKuppusamy P
 
Machine learning Introduction
Machine learning IntroductionMachine learning Introduction
Machine learning IntroductionKuppusamy P
 

More from Kuppusamy P (20)

Recurrent neural networks rnn
Recurrent neural networks   rnnRecurrent neural networks   rnn
Recurrent neural networks rnn
 
Deep learning
Deep learningDeep learning
Deep learning
 
Image segmentation
Image segmentationImage segmentation
Image segmentation
 
Image enhancement
Image enhancementImage enhancement
Image enhancement
 
Feature detection and matching
Feature detection and matchingFeature detection and matching
Feature detection and matching
 
Image processing, Noise, Noise Removal filters
Image processing, Noise, Noise Removal filtersImage processing, Noise, Noise Removal filters
Image processing, Noise, Noise Removal filters
 
Flowchart design for algorithms
Flowchart design for algorithmsFlowchart design for algorithms
Flowchart design for algorithms
 
Algorithm basics
Algorithm basicsAlgorithm basics
Algorithm basics
 
Problem solving using Programming
Problem solving using ProgrammingProblem solving using Programming
Problem solving using Programming
 
Parts of Computer, Hardware and Software
Parts of Computer, Hardware and Software Parts of Computer, Hardware and Software
Parts of Computer, Hardware and Software
 
Strings in java
Strings in javaStrings in java
Strings in java
 
Java methods or Subroutines or Functions
Java methods or Subroutines or FunctionsJava methods or Subroutines or Functions
Java methods or Subroutines or Functions
 
Java arrays
Java arraysJava arrays
Java arrays
 
Java iterative statements
Java iterative statementsJava iterative statements
Java iterative statements
 
Java conditional statements
Java conditional statementsJava conditional statements
Java conditional statements
 
Java data types
Java data typesJava data types
Java data types
 
Java introduction
Java introductionJava introduction
Java introduction
 
Anomaly detection (Unsupervised Learning) in Machine Learning
Anomaly detection (Unsupervised Learning) in Machine LearningAnomaly detection (Unsupervised Learning) in Machine Learning
Anomaly detection (Unsupervised Learning) in Machine Learning
 
Machine Learning Performance metrics for classification
Machine Learning Performance metrics for classificationMachine Learning Performance metrics for classification
Machine Learning Performance metrics for classification
 
Machine learning Introduction
Machine learning IntroductionMachine learning Introduction
Machine learning Introduction
 

Recently uploaded

Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4MiaBumagat1
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1
 
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxGrade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxChelloAnnAsuncion2
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfphamnguyenenglishnb
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxDr.Ibrahim Hassaan
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomnelietumpap1
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Celine George
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Jisc
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17Celine George
 

Recently uploaded (20)

Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4
 
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptxYOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
 
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxGrade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
 
Raw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptxRaw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptx
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptx
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choom
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17
 

Reinforcement learning, Q-Learning

  • 1. Reinforcement Learning Dr. P. Kuppusamy Prof / CSE
  • 2. Reinforcement learning • Reinforcement Learning is the agent must sense the environment, learns to behave (act) in a environment by performing actions (reinforcement) and seeing the results. • Task - Learn how to behave successfully to achieve a goal while interacting with an external environment. - The goal of the agent is to learn an action policy that maximizes the total reward it will receive from any starting state. • Examples – Game playing: player knows whether it win or lose, but not know how to move at each step
  • 3. Applications • A robot cleaning the room and recharging its battery • Robot-soccer • invest in shares • Modeling the economy through rational agents • Learning how to fly a helicopter • Scheduling planes to their destinations
  • 4. Reinforcement Learning Process • RL contains two primary components: 1. Agent (A) – RL algorithm that learns from trial and error 2. Environment – World Space in which the agent moves (interact and take action) • State (S) – Current situation returned by the environment • Reward (R) – An immediate return from the environment to appraise the last action • Policy (π) –Agent uses this approach to decide the next action based on the current state • Value (V) – Expected long-term return with discount. Oppose to the short-term reward (R) • Action-Value (Q) – Similar to value except it contains an additional parameter, the current action (A) Figure shows RL is learning from interaction
  • 5. RL Approaches • Two approaches – Model based approach RL: • learn the model, and use it to derive the optimal policy. e.g Adaptive dynamic learning(ADP) approach – Model free approach RL: • derive the optimal policy without learning the model. e.g LMS and Temporal difference approach • Passive learning – The agent imply watches the world during transition and tries to learn the utilities in various states • Active learning – The agent not simply watches, but also acts on the environment
  • 6. Example :S A   • Immediate reward is worth more than future reward. • Reward Maximization - Agent is trained to take best (optimal) action to get maximum reward
  • 7. Reward Maximization :S A   • Exploration – Search and capture more information about the environment • Exploitation – Use the already known information to maximize the rewards
  • 8. Reinforcement learning model • Each percept(e) is enough to determine the State(the state is accessible) • Agent’s task: Find a optimal policy by mapping states of environment to actions of the agent, that maximize long-run measure of the reward (reinforcement) • It can be modeled as Markov Decision Process (MDP) model. • Markov decision process (MDP) is a a mathematical framework for modeling decision making i.e mapping a solution in reinforcement learning.
  • 9. MDP model • MDP model <S,T,A,R> Agent Environment State Reward Action s0 r0 a0 s1 a1 r1 s2 a2 r2 s3 • S– set of states • A– set of actions • Transition Function: T(s,a,s’) = P(s’|s,a) – the probability of transition from s to s’ given action a T(s,a)  s’ • Reward Function: r(s,a)  r the expected reward for taking action a in state s     ' ' ) ' , , ( ) ' , , ( ) , ( ) ' , , ( ) , | ' ( ) , ( s s s a s r s a s T a s R s a s r a s s P a s R
  • 10. MDP - Example I • Consider the graph, and find the shortest path from a node S to a goal node G. • Set of states {S, T, U, V} • Action – Traversal from one state to another state • Reward - Traversing an edge provides “length edge” in dollars. • Policy – Path considered to reach the destination {STV} G S U T V 14 51 25 15 -5 -22
  • 11. Q - Learning • Q-Learning is a value-based reinforcement learning algorithm uses Q- values (action values) to iteratively improve the behavior of the learning agent. • Goal is to maximize the Q value to find the optimal action-selection policy. • The Q table helps to find the best action for each state and maximize the expected reward. • Q-Values / Action-Values: Q-values are defined for states and actions. • Q(s, a) denotes an estimation of the action a at the state s. • This estimation of Q(s, a) will be iteratively computed using the TD- Update rule. • Reward: At every transition, the agent observes a reward for every action from the environment, and then transits to another state. • Episode: If at any point of time the agent ends up in one of the terminating states i.e. there are no further transition possible is called completion of an episode.
  • 12. Q-Learning • Initially agent explore the environment and update the Q-Table. When the Q-Table is ready, the agent will start to exploit the environment and taking better actions. • It is an off-policy control algorithm i.e. the updated policy is different from the behavior policy. It estimates the reward for future actions and appends a value to the new state without any greedy policy Temporal Difference or TD-Update: • Estimate the value of Q is applied at every time step of the agents interaction with the environment Advantage: • Converges to an optimal policy in both deterministic and nondeterministic MDPs. Disadvantage: • Suitable for small problems.
  • 13. Understanding the Q – Learning • Building Environment contains 5 rooms that are connected with doors. • Each room is numbered from 0 to 4. The building outside is numbered as 5. • Doors from room 1 and 4 leads to the building outside 5. • Problem: Agent can place at any one of the rooms (0, 1, 2, 3, 4). Agent’s goal is to reach the building outside (room 5).
  • 14. Understanding the Q – Learning • Represent the room in the graph. • Room number is the state and door is the edge.
  • 15. Understanding the Q – Learning • Assign the Reward value to each door. • The doors lead immediately to target is assigned an instant reward of 100. • Other doors not directly connected to the target room have zero reward. • For example, doors are two-way ( 0 leads to 4, and 4 leads back to 0 ), two edges are assigned to each room. • Each edge contains an instant reward value
  • 16. Understanding the Q – Learning • Let consider agent starts from state s (Room) 2. • Agent’s movement from one state to another state is action a. • Agent is traversing from state 2 to state 5 (Target). – Initial state = current state i.e. state 2 – Transition State 2  State 3 – Transition State 3  State (2, 1, 4) – Transition State 4  State 5
  • 17. Understanding the Q – Learning
  • 18. Understanding the Q – Learning: Prepare matrix Q • Matrix Q is the memory of the agent in which learned information from experience is stored. • Row denotes the current state of the agent • Column denotes the possible actions leading to the next state Compute Q matrix: Q(state, action) = R(state, action) + Gamma * max[Q(next state, all actions)] • Gamma is discounting factor for future rewards. Its range is 0 to 1. i.e. 0 < Gamma <1. • Future rewards are less valuable than current rewards so they must be discounted. • If Gamma is closer to 0, the agent will tend to consider only the immediate rewards. • If Gamma is closer to 1, the agent will tend to consider only future rewards with higher edge weights.
  • 19. Q – Learning Algorithm • Set the gamma parameter • Set environment rewards in matrix R • Initialize matrix Q as Zero – Select random initial (source) state • Set initial state s = current state – Select one action a among all possible actions using exploratory policy • Take this possible action a, going to the next state s’. • Observe reward r – Get maximum Q value to go to next state based on all possible actions • Compute: – Q(state, action) = R(state, action) + Gamma * max[Q(next state, all actions)] • Repeat the above steps until reach the goal state i.e current state = goal state
  • 20. Example: Q – Learning 0 1 2 3 4 5 Action 0 1 2 3 4 5 State
  • 21. Example: Q – Learning 0 1 2 3 4 5 Action 0 1 2 3 4 5 State 3 1 5 5 1 4 𝑅 = 𝟎 𝟏 𝟐 𝟑 𝟒 𝟓 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 0 −1 0 −1 100 0 −1 100 −1 0 0 0 −1 −1 −1 0 −1 −1 0 −1 0 −1 100 −1 0 100 Q = 𝟎 𝟏 𝟐 𝟑 𝟒 𝟓 0 0 0 0 0 0 0 0 0 0 0 0 0 0 𝟏𝟎𝟎 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 • Update the Matrix Q.
  • 22. • For next episode, choose next state 3 randomly that becomes current state. • State 3 contains 3 choices i.e. state 1, 2 or 4. • Let’s choose state 1. • Compute max Q value to go to next state based on all possible actions. Q(state, action) = R(state, action) + Gamma * max[Q(next state, all actions)] • Q(3,1) = R(3,1) + 0.8 * max[Q(1,3), Q(1,5)] = 0 + 0.8 * max[0, 100] = 0 + 80 = 80 • Update the Matrix Q. 0 1 2 3 4 5 Action 0 1 2 3 4 5 State 𝑅 = 𝟎 𝟏 𝟐 𝟑 𝟒 𝟓 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 0 −1 0 −1 100 0 −1 100 −1 0 0 0 −1 −1 −1 0 −1 −1 0 −1 0 −1 100 −1 0 100 Q = 𝟎 𝟏 𝟐 𝟑 𝟒 𝟓 0 0 0 0 0 0 0 0 0 0 0 0 0 0 100 0 0 0 0 𝟖𝟎 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 3 5 4 1 2
  • 23. • For next episode, next state 1 becomes current state • Repeat the inner loop due to 1 is not target state • From State 1, either can go to 3 or 5. • Let’s choose state 5. • Compute max Q value to go to next state based on all possible actions. • Q(state, action) = R(state, action) + Gamma * max[Q(next state, all actions)] • Q(1,5) = R(1,5) + 0.8 * max[Q(5,1), Q(5,4), Q(5,5)] = 100 + 0.8 * max[0, 0, 0] = 100 + 0 = 100 • Q remains the same due to Q(1,5) is already fed into the agent. Stop process 0 1 2 3 4 5 Action 0 1 2 3 4 5 State 𝑅 = 𝟎 𝟏 𝟐 𝟑 𝟒 𝟓 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 0 −1 0 −1 100 0 −1 100 −1 0 0 0 −1 −1 −1 0 −1 −1 0 −1 0 −1 100 −1 0 100 Q = 𝟎 𝟏 𝟐 𝟑 𝟒 𝟓 0 0 0 0 0 0 0 0 0 0 0 0 0 0 100 0 0 0 0 𝟖𝟎 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 1 5 5 1 4
  • 24. References • Tom Markiewicz& Josh Zheng,Getting started with Artificial Intelligence, Published by O’Reilly Media,2017 • Stuart J. Russell and Peter Norvig,Artificial Intelligence A Modern Approach • Richard Szeliski, Computer Vision: Algorithms and Applications, Springer 2010