SlideShare a Scribd company logo
1 of 59
Reinforcement Learning
Dr.S.Nagaraju
Overview
1 About Reinforcement Learning
2 The Reinforcement Learning Problem
Rewards
Environment
State
3 Inside an RL Agent
4 Problems within Reinforcement Learning & MDP& MDP
About Reinforcement Learning
The Reinforcement Learning
Inside the RL Agent
Problems within Reinforcement Learning
Many Facets of Reinforcement Learning
Figure: Many Facets of Reinforcement Learning.
About Reinforcement Learning
Branches of Machine Learning
Figure: Branches of Machine Learning.
About Reinforcement Learning
About Reinforcement Learning
The Reinforcement Learning
Inside the RL Agent
Problems within Reinforcement Learning
 Science of sequence of decision making deals with agents that must sense
& act upon their environment.
 An agent learns to behave in an environment by performing the actions
and seeing the results of previous actions.
 For each good action, the agent gets positive feedback, and for each
bad action, the agent gets negative feedback.
 In RL, the agent learns automatically using feedbacks without any
labelled data, unlike supervised learning.
What is Reinforcement Learning?
About Reinforcement Learning
About Reinforcement Learning
The Reinforcement Learning
Inside the RL Agent
Problems within Reinforcement Learning
Characteristics of Reinforcement Learning
What makes reinforcement learning different from other machine
learning paradigms?
No supervisor, only a reward signal
Feedback is delayed, not instantaneous
Time really matters(Sequential, non i.i.d data)
Agent’s actions affect the subsequent data it receives
About Reinforcement Learning
About Reinforcement Learning
The Reinforcement Learning
Inside the RL Agent
Problems within Reinforcement Learning
The ReinforcementLearningComponents
Rewards
A reward Rt is a scalar feedback signal
Indicates how well agent is doing at time-step t
The agent’s job is to maximise cumulative reward
Reinforcement learning is based on the reward hypothesis
Definition (Reward Hypothesis)
All goals can be described by the maximisation of expected
cumulative reward
About Reinforcement Learning
The Reinforcement Learning
Inside the RL Agent
Problems within Reinforcement Learning
About Reinforcement Learning
Reinforcement Learning Components
Inside the RL Agent
Problems within Reinforcement Learning
Rewards
Environment
State
Examples of Rewards
Fly stunt manoeuvres in a helicopter
+ve reward for following desired trajectory
-ve reward for crashing walk
Defeat the world champion
+/-ve reward for winning/losing a game
About Reinforcement Learning
Reinforcement Learning Components
Inside the RL Agent
Problems within Reinforcement Learning
Rewards
Environment
State
Examples of Rewards
Control a power station
+ve reward for producing power
-ve reward for exceeding safety thresholds
Make a humanoid robot walk
+ve reward for forward motion
-ve reward for falling over
About Reinforcement Learning The
Reinforcement Learning Components
Inside the RL Agent
Problems within Reinforcement Learning
Rewards
Environment
State
Agent and Environment
At each step t the agent:
Executes action At
Receives observation Ot
Receives scalar reward
Rt
The environment:
Receives action At
Emits observation Ot
Emits scalar reward Rt
t increments at env. step
About Reinforcement Learning
Reinforcement Learning Components
Inside the RL Agent
Problems within Reinforcement Learning
Rewards
Environment
State
History and State
The history is the sequence of actions, observations, rewards
Ht = A1, O1, R1, . . . , At, Ot, Rt
What happens next depends on the history:
The agent selects actions
The environment selects observations/rewards
State is the concise information used to determine what
happens next
Formally, state is a function of the history:
St = f (Ht)
About Reinforcement Learning
Reinforcement Learning Components
Inside the RL Agent
Problems within Reinforcement Learning
Rewards
Environment
State
Environment State
The environment state Se
t
is the environment data
representation
i.e. whatever data the
environment uses to pick
the next
observation/reward
The environment can be
real-world or simulated
About Reinforcement Learning
Reinforcement Learning Components
Inside the RL Agent
Problems within Reinforcement Learning
Rewards
Environment
State
Agent State
t
The agent state Sa is the
agent’s internal
representation
i.e. whatever information
the agent uses to pick the
next action
i.e. it is the information
used by reinforcement
learning algorithms
It can be function of
history:
t
Sa = f (Ht)
About Reinforcement Learning
Reinforcement Learning Components
Inside the RL Agent
Problems within Reinforcement Learning
Rewards
Environment
State
Information State
An information state(a.k.a.Markov state) contains all useful
information from the history.
Definition
A state St is Markov if and only if
P[St+1|St] = P[St+1|S1, . . . , St]
Future is independent of the past given the present”
Once the state is known, the history may be thrown away
i.e. The state is a sufficient statistic of the future
The environment state Se is Markov
t
About Reinforcement Learning
Reinforcement Learning Components
Inside the RL Agent
Problems within Reinforcement Learning
Rewards
Environment
State
Fully Observable Environments
Full observability: agent directly
observes environment
St = Sa = Se
t t
Agent state = environment
state = information state
Formally, this is a Markov
decision process (MDP)
About Reinforcement Learning
Reinforcement Learning Components
Inside the RL Agent
Problems within Reinforcement Learning
Rewards
Environment
State
Partially Observable Environments
Partial observability: agent indirectly observes environment:
A robot with camera vision isn’t told its absolute location
Now agent state =
/ environment state, formally this is a
partially observable Markov decision process(POMDP)
About Reinforcement Learning
Reinforcement Learning Components
Inside the RL Agent
Problems within Reinforcement Learning
Major Components of the RL Agent
RL agent may include one or more of these components:
Policy: agent’s behavior /what actions can perform
Value function: prediction of future reward
Model: agent’s view of the environment
Inside the RL Agent
About Reinforcement Learning
The Reinforcement Learning Problem
Inside the RL Agent
Problems within Reinforcement Learning
Policy
A policy is the agent’s behaviour
It is a map from state to action
Deterministic policy: a = π(s)
Stochastic policy: π(a|s) = P[At = a|St = s]
Inside the RL Agent
About Reinforcement Learning
The Reinforcement Learning Problem
Inside the RL Agent
Problems within Reinforcement Learning
Value Function
Value function is a prediction of future reward
Used to evaluate the goodness/badness of states&/actions
Therefore to select between actions, needs sum of cumulative
rewards
vπ(s) = Eπ[Rt + γRt+1 + γ2Rt+2 + . . . |St = s]
Inside the RL Agent
About Reinforcement Learning
The Reinforcement Learning Problem
Inside the RL Agent
Problems within Reinforcement Learning
Model
Agent’s view of the environment
P predicts the next state (Transition Model)
R predicts the next (immediate) reward, (Reward Model)
ss′
P = P[St+1
a J
t t
= s |S = s, A = a]
a
s
R = E[Rt+1 t t
|S = s, A = a]
Inside the RL Agent
About Reinforcement Learning
The Reinforcement Learning Problem
Inside the RL Agent
Problems within Reinforcement Learning
Maze Example
Rewards: -1 per
time-step
Actions: N, E, S, W
States: Agent’s location
About Reinforcement Learning
The Reinforcement Learning Problem
Inside the RL Agent
Problems within Reinforcement Learning
Maze Example Policy
Arrows represent policy π(s) for each state s
About Reinforcement Learning
The Reinforcement Learning Problem
Inside the RL Agent
Problems within Reinforcement Learning
Maze Example: Value Function
Numbers represent value vπ(s) of each state s (No. of steps the
agent is away from the goal)
About Reinforcement Learning
The Reinforcement Learning Problem
Inside the RL Agent
Problems within Reinforcement Learning
Maze Example: Model
Grid layout represents transition model P a
ss′
s
Numbers represent immediate reward Ra for each state s
(same for all a)
About Reinforcement Learning
The Reinforcement Learning Problem
Inside the RL Agent
Problems within Reinforcement Learning
Categorizing RL agents (1)
Value Based
No Policy (Implicit)
Value Function
Policy Based
Policy
No Value Function
Actor Critic
Policy
Value Function
About Reinforcement Learning
The Reinforcement Learning Problem
Inside the RL Agent
Problems within Reinforcement Learning
Categorizing RL agents (2)
Model Free
Policy and/or Value Function
No Model
Model Based
Policy and/or Value Function
Model
About Reinforcement Learning
The Reinforcement Learning Problem
Inside the RL Agent
Problems within Reinforcement Learning
Reinforcement Learning Agent Taxonomy
Figure: Reinforcement Learning Agent Taxonomy.
About Reinforcement Learning
The Reinforcement Learning Problem
Inside the RL Agent
Problems within Reinforcement Learning
Learning and Planning
Two fundamental problems in sequential decision making
Reinforcement Learning:
The environment model is initially unknown
The agent interacts with the environment
The agent improves its policy
Planning:
A model of the environment is known
The agent performs computations with its model (without any
interaction)
The agent improves its policy
a.k.a. deliberation, reasoning, introspection, pondering,
thought, search
About Reinforcement Learning
The Reinforcement Learning Problem
Inside the RL Agent
Problems within Reinforcement Learning
Exploration and Exploitation
Reinforcement learning is like trial-and-error learning
The agent should discover a good policy
From its experiences of the environment
Without losing too much reward along the way
Exploration finds more information about the environment
Exploitation exploits known information to maximise reward
It is usually important to explore as well as exploit
About Reinforcement Learning
The Reinforcement Learning Problem
Inside the RL Agent
Problems within Reinforcement Learning
Exploration and Exploitation Examples
Restaurant Selection
Exploitation: Go to your favourite restaurant
Exploration: Try a new restaurant
Oil Drilling
Exploitation: Drill at the best known location
Exploration: Drill at a new location
Game Playing
Exploitation: Play the move you believe is best
Exploration: Play an experimental move
Online Banner Advertisements
Exploitation: Show the most successful advert
Exploration: Show a different advert
About Reinforcement Learning
The Reinforcement Learning Problem
Inside the RL Agent
Problems within Reinforcement Learning
Prediction and Control
Prediction: evaluate the future
Given a policy
Control: optimise the future
Find the best policy
Markov Decision Process (MDP)
Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Overview
Markov Decision Process
Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Introduction to MDP
Markov decision process is a mathematical environment for
reinforcement learning
Where the environment is fully observable
Almost all RL problems can be formalised as MDPs
Partially observable problems can be converted into MDPs
Markov Decision Process
Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Markov Property
Markov Chain
Markov Property
”The future is independent of the past given the present”
Definition
A state St is Markov it determines being in one state what can be
the transition probability to other state P[St +1|St ] = P[St +1|S1,..,St ]
The state captures all relevant information from the history
Once the state is known, the history may be thrown away
i.e. The state is a sufficient statistic of the future MDPs
Markov Decision Process
Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Markov Property
Markov Chain
State Transition Matrix
For a Markov state s and successor state s’, the state
transition probability is denoted by
Pss′ = P[St +1 = s’|St = s]
State transition matrix P denotes transition probabilities from
all states s to all successor states s’,
...
P = from . ... .
...
Pn1 Pnn
P11 P1n
where each row of the matrix sums to 1.
Markov Decision Process
Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Markov Property
Markov Chain
Markov Process
A Markov process is a memoryless process, i.e. a sequence of
states S1,S2,... with the Markov property.
Definition
A Markov Process (or Markov Chain) is a tuple < S ,P >
S is a (finite) set of states
P is a state transition probability matrix,
Pss' = P[St+1 = s′|St = s]
Markov Decision Process
Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Markov Property
Markov Chain
Example: Student Markov Chain/Process
Markov Decision Process
Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Markov Property
Markov Chain
Example: Student Markov Process Episodes
Sample episodes for Student Markov
Chain starting from S1 = C1
S1,S2,...,ST
C1 C2 C3 Pass Sleep
C1 FB FB C1 C2 Sleep
C1 C2 C3 Pub C2 C3 Pass
Sleep
C1 FB FB C1 C2 C3 Pub C1 FB
FB FB C1 C2 C3 Pub C2 Sleep
Markov Decision Process
Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Markov Property
Markov Chain
Example: Student Markov Chain Transition Matrix
P =
Sleep
c1
c2 0.2
1.0
Pub 0.2 0.4 0.4
c1 c2 c3 Pass Pub FB
0.5 0.5
0.8
c3 0.6 0.4
Pass
FB 0.1 0.9
Sleep 1
Markov Decision Process
Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Return
Value function
Bellman Equation
Markov Reward Process
A Markov reward process is a Markov chain with rewards.
Definition
A Markov Reward Process is a tuple < S ,P ,R ,γ >
S is a (finite) set of states
P is a state transition probability matrix,
Pss' = P[St+1 = s′|St = s]
R is a reward function, Rs = E[Rt+1|St = s]
γ is a discount factor, γ ∈ [0;1]
Markov Decision Process
Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Return
Value function
Bellman Equation
Example: Student MRP
Markov Decision Process
Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Return
Value function
Bellman Equation
Returns
Definition
The return Gt is the total discounted reward from time-step t.
Gt = Rt +1 + γRt +2 + γ2Rt +3 + ...
t
inf
Σ
G = Rt +k+1
k=0
The discount γ ∈ [0,1] is the present value of future rewards
The value of receiving reward R after k + 1 time-steps is γkR
Markov Decision Process
Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Return
Value function
Bellman Equation
Why discount?
Most Markov reward and decision processes are discounted. Why?
Mathematically convenient to use
Avoids infinite returns in cyclic Markov processes
If the reward is financial, immediate rewards may earn more
interest than delayed rewards
Human behaviour shows preference for immediate reward
Markov Decision Process
Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Return
Value function
Bellman Equation
Value Function
The value function v(s) gives the long-term value of state s
Definition
The state value function v(s) of an MRP is the expected return
starting from state s
v(s) = E[Gt |St = s]
Markov Decision Process
Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Return
Value function
Bellman Equation
Example: Student MRP Returns
Sample returns for Student MRP:
Starting from S1 = C1 with γ = 1
2
T 2
G1 = R2 + γR3 + · ·· + γ RT − 2
C1 C2 C3 Pass Sleep
C1 FB FB C1 C2 Sleep
C1 C2 C3 Pub C2 C3 Pass Sleep
C1 FB FB C1 C2 C3 Pub C1 FB FB
FB C1 C2 C3 Pub C2 Sleep
1
2
1
4
1
8
G1 = -2 - 2 × − 2 × + 10 × = −2.25
1 1 1 1
2 4 8 16
G1 = -2 - 1 × − 1 × − 2 × − 2 × = −3.125
1 1 1 1
2 4 8 16
G1 = -2 -2 × − 2 × + 1 × − 2 × + ···= − 3.41
1 1 1 1
2 4 8 16
G1 = -2 - 1 × − 1 × − 2 × − 2 × + ···= − 3.20
Markov Decision Process
Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Return
Value function
Bellman Equation
Example: State-Value Function for Student MRP(1)
Markov Decision Process
Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Return
Value function
Bellman Equation
Example: State-Value Function for Student MRP(2)
Markov Decision Process
Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Return
Value function
Bellman Equation
Example: State-Value Function for Student MRP(3)
Markov Decision Process
Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Return
Value function
Bellman Equation
Bellman Equation for MRPs
The value function can be decomposed into two parts:
immediate reward Rt+1
discounted value of successor state (St+1)
v(s) = E[Gt |St = s]
= E[Rt +1 + γRt +2 + γ2Rt +3 + ...|St = s]
= E[Rt +1 + γ(Rt +2 + γRt +3 + ...)|St = s]
= E[Rt +1 + γV (St +1)|St = s]
Markov Decision Process
Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Return
Value function
Bellman Equation
Example: Bellman Equation for Student MRP
Markov Decision Process
Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Return
Value function
Bellman Equation
Bellman Equation in Matrix Form
The Bellman equation can be expressed concisely using matrices,
v = R + γP v
where v is a column vector with one entry per state
v(n)
=
v(1) R
. .
+ γ
P ... P
1 11 1n
R n ...
Pn1 Pnn
v(1)
.
. . .
v(n)
Markov Decision Process
Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Markov Decision Process
Value Function
Bellman Expectation Equation
Markov Decision Process
A Markov decision process (MDP) is a Markov reward process with
decisions. It is an environment in which all states are Markov.
Definition
A Markov Reward Process is a tuple < S ,A ,P ,R ,γ >
S is a (finite) set of states
A is a finite set of actions
P is a state transition probability matrix,
Pa ′
ss t +1 t t
' = P[S = s |S = s,A = a]
a
s
a
t +1
R is a reward function, R = E[R |St = s,A = a]
γ is a discount factor, γ ∈ [0;1]
Markov Decision Process
Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Example: Student MDP
Markov Decision Process
Markov Decision Process
Value Function
Bellman Expectation Equation
Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Value Function
Definition
The state-value function vπ(s) of an MDP is the expected return
starting from state s, and then following policy
vπ(s) = Eπ[Gt |St = s]
Definition
The action-value function qπ(s,a) is the expected return starting
from state s, taking action a, and then following policy π
qπ(s,a) = Eπ[Gt |St = s,At = a]
Markov Decision Process
Markov Decision Process
Value Function
Bellman Expectation Equation
Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Example: State-Value Function for Student MDP
Markov Decision Process
Markov Decision Process
Value Function
Bellman Expectation Equation
Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Bellman Expectation Equation
The state-value function can again be decomposed into immediate
reward plus discounted value of successor state,
vπ(s) = Eπ[Rt +1 + γvπ(st +1)|St = s]
The action-value function can similarly be decomposed,
qπ(s,a) = Eπ[Rt+1 + γqπ(St+1, At+1)|St = s,At = a]
Markov Decision Process
Markov Decision Process
Value Function
Bellman Expectation Equation
Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Example: Bellman Expectation Equation for Student MDP
Markov Decision Process
Markov Decision Process
Value Function
Bellman Expectation Equation
Conclusion
Reinforcement Learning is one of the most interesting and useful
parts of Machine learning.
In RL, the agent explores the environment without any human
intervention. It is the main learning algorithm that is used in
Artificial Intelligence.
The main issue with the RL algorithm is that some of the
parameters may affect the speed of the learning, such as delayed
feedback.

More Related Content

Similar to RL_Dr.SNR Final ppt for Presentation 28.05.2021.pptx

Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017MLconf
 
Reinforcement Learning.ppt
Reinforcement Learning.pptReinforcement Learning.ppt
Reinforcement Learning.pptPOOJASHREEC1
 
reinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdfreinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdfVaishnavGhadge1
 
Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning Chandra Meena
 
Reinforcement Learning with Amazon SageMaker RL
Reinforcement Learning with Amazon SageMaker RLReinforcement Learning with Amazon SageMaker RL
Reinforcement Learning with Amazon SageMaker RLThom Lane
 
Reinforcement learning
Reinforcement  learningReinforcement  learning
Reinforcement learningSKS
 
acai01-updated.ppt
acai01-updated.pptacai01-updated.ppt
acai01-updated.pptbutest
 
What is Reinforcement Learning.pdf
What is Reinforcement Learning.pdfWhat is Reinforcement Learning.pdf
What is Reinforcement Learning.pdfAiblogtech
 
Introduction to Reinforcement Learning.pptx
Introduction to Reinforcement Learning.pptxIntroduction to Reinforcement Learning.pptx
Introduction to Reinforcement Learning.pptxHarsha Patel
 
Reinforcement learning 7313
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313Slideshare
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysYasutoTamura1
 
Lecture 1 - introduction.pdf
Lecture 1 - introduction.pdfLecture 1 - introduction.pdf
Lecture 1 - introduction.pdfNamanJain758248
 
REINFORCEMENT LEARNING (reinforced through trial and error).pptx
REINFORCEMENT LEARNING (reinforced through trial and error).pptxREINFORCEMENT LEARNING (reinforced through trial and error).pptx
REINFORCEMENT LEARNING (reinforced through trial and error).pptxarchayacb21
 
Naive Reinforcement algorithm
Naive Reinforcement algorithmNaive Reinforcement algorithm
Naive Reinforcement algorithmSameerJolly2
 
rlpptgroup3-231018180804-0c05fb2f789piutt
rlpptgroup3-231018180804-0c05fb2f789piuttrlpptgroup3-231018180804-0c05fb2f789piutt
rlpptgroup3-231018180804-0c05fb2f789piutt201roopikha
 
Introduce to Reinforcement Learning
Introduce to Reinforcement LearningIntroduce to Reinforcement Learning
Introduce to Reinforcement LearningNguyen Luong An Phu
 
Harm van Seijen, Research Scientist, Maluuba at MLconf SF 2016
Harm van Seijen, Research Scientist, Maluuba at MLconf SF 2016Harm van Seijen, Research Scientist, Maluuba at MLconf SF 2016
Harm van Seijen, Research Scientist, Maluuba at MLconf SF 2016MLconf
 
Reinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine SweeperReinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine SweeperDataScienceLab
 

Similar to RL_Dr.SNR Final ppt for Presentation 28.05.2021.pptx (20)

Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
 
Reinforcement Learning.ppt
Reinforcement Learning.pptReinforcement Learning.ppt
Reinforcement Learning.ppt
 
reinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdfreinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdf
 
Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning
 
Reinforcement Learning with Amazon SageMaker RL
Reinforcement Learning with Amazon SageMaker RLReinforcement Learning with Amazon SageMaker RL
Reinforcement Learning with Amazon SageMaker RL
 
Reinforcement learning
Reinforcement  learningReinforcement  learning
Reinforcement learning
 
acai01-updated.ppt
acai01-updated.pptacai01-updated.ppt
acai01-updated.ppt
 
What is Reinforcement Learning.pdf
What is Reinforcement Learning.pdfWhat is Reinforcement Learning.pdf
What is Reinforcement Learning.pdf
 
Introduction to Reinforcement Learning.pptx
Introduction to Reinforcement Learning.pptxIntroduction to Reinforcement Learning.pptx
Introduction to Reinforcement Learning.pptx
 
Reinforcement learning 7313
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative ways
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Lecture 1 - introduction.pdf
Lecture 1 - introduction.pdfLecture 1 - introduction.pdf
Lecture 1 - introduction.pdf
 
REINFORCEMENT LEARNING (reinforced through trial and error).pptx
REINFORCEMENT LEARNING (reinforced through trial and error).pptxREINFORCEMENT LEARNING (reinforced through trial and error).pptx
REINFORCEMENT LEARNING (reinforced through trial and error).pptx
 
Naive Reinforcement algorithm
Naive Reinforcement algorithmNaive Reinforcement algorithm
Naive Reinforcement algorithm
 
rlpptgroup3-231018180804-0c05fb2f789piutt
rlpptgroup3-231018180804-0c05fb2f789piuttrlpptgroup3-231018180804-0c05fb2f789piutt
rlpptgroup3-231018180804-0c05fb2f789piutt
 
Introduce to Reinforcement Learning
Introduce to Reinforcement LearningIntroduce to Reinforcement Learning
Introduce to Reinforcement Learning
 
Harm van Seijen, Research Scientist, Maluuba at MLconf SF 2016
Harm van Seijen, Research Scientist, Maluuba at MLconf SF 2016Harm van Seijen, Research Scientist, Maluuba at MLconf SF 2016
Harm van Seijen, Research Scientist, Maluuba at MLconf SF 2016
 
Reinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine SweeperReinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine Sweeper
 
Q_Learning.ppt
Q_Learning.pptQ_Learning.ppt
Q_Learning.ppt
 

Recently uploaded

result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college projectTonystark477637
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)simmis5
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfKamal Acharya
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
Glass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesGlass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesPrabhanshu Chaturvedi
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...Call Girls in Nagpur High Profile
 

Recently uploaded (20)

result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
Glass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesGlass Ceramics: Processing and Properties
Glass Ceramics: Processing and Properties
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 

RL_Dr.SNR Final ppt for Presentation 28.05.2021.pptx

  • 2. Overview 1 About Reinforcement Learning 2 The Reinforcement Learning Problem Rewards Environment State 3 Inside an RL Agent 4 Problems within Reinforcement Learning & MDP& MDP
  • 3. About Reinforcement Learning The Reinforcement Learning Inside the RL Agent Problems within Reinforcement Learning Many Facets of Reinforcement Learning Figure: Many Facets of Reinforcement Learning. About Reinforcement Learning
  • 4. Branches of Machine Learning Figure: Branches of Machine Learning. About Reinforcement Learning About Reinforcement Learning The Reinforcement Learning Inside the RL Agent Problems within Reinforcement Learning
  • 5.  Science of sequence of decision making deals with agents that must sense & act upon their environment.  An agent learns to behave in an environment by performing the actions and seeing the results of previous actions.  For each good action, the agent gets positive feedback, and for each bad action, the agent gets negative feedback.  In RL, the agent learns automatically using feedbacks without any labelled data, unlike supervised learning. What is Reinforcement Learning? About Reinforcement Learning About Reinforcement Learning The Reinforcement Learning Inside the RL Agent Problems within Reinforcement Learning
  • 6. Characteristics of Reinforcement Learning What makes reinforcement learning different from other machine learning paradigms? No supervisor, only a reward signal Feedback is delayed, not instantaneous Time really matters(Sequential, non i.i.d data) Agent’s actions affect the subsequent data it receives About Reinforcement Learning About Reinforcement Learning The Reinforcement Learning Inside the RL Agent Problems within Reinforcement Learning
  • 7. The ReinforcementLearningComponents Rewards A reward Rt is a scalar feedback signal Indicates how well agent is doing at time-step t The agent’s job is to maximise cumulative reward Reinforcement learning is based on the reward hypothesis Definition (Reward Hypothesis) All goals can be described by the maximisation of expected cumulative reward About Reinforcement Learning The Reinforcement Learning Inside the RL Agent Problems within Reinforcement Learning
  • 8. About Reinforcement Learning Reinforcement Learning Components Inside the RL Agent Problems within Reinforcement Learning Rewards Environment State Examples of Rewards Fly stunt manoeuvres in a helicopter +ve reward for following desired trajectory -ve reward for crashing walk Defeat the world champion +/-ve reward for winning/losing a game
  • 9. About Reinforcement Learning Reinforcement Learning Components Inside the RL Agent Problems within Reinforcement Learning Rewards Environment State Examples of Rewards Control a power station +ve reward for producing power -ve reward for exceeding safety thresholds Make a humanoid robot walk +ve reward for forward motion -ve reward for falling over
  • 10. About Reinforcement Learning The Reinforcement Learning Components Inside the RL Agent Problems within Reinforcement Learning Rewards Environment State Agent and Environment At each step t the agent: Executes action At Receives observation Ot Receives scalar reward Rt The environment: Receives action At Emits observation Ot Emits scalar reward Rt t increments at env. step
  • 11. About Reinforcement Learning Reinforcement Learning Components Inside the RL Agent Problems within Reinforcement Learning Rewards Environment State History and State The history is the sequence of actions, observations, rewards Ht = A1, O1, R1, . . . , At, Ot, Rt What happens next depends on the history: The agent selects actions The environment selects observations/rewards State is the concise information used to determine what happens next Formally, state is a function of the history: St = f (Ht)
  • 12. About Reinforcement Learning Reinforcement Learning Components Inside the RL Agent Problems within Reinforcement Learning Rewards Environment State Environment State The environment state Se t is the environment data representation i.e. whatever data the environment uses to pick the next observation/reward The environment can be real-world or simulated
  • 13. About Reinforcement Learning Reinforcement Learning Components Inside the RL Agent Problems within Reinforcement Learning Rewards Environment State Agent State t The agent state Sa is the agent’s internal representation i.e. whatever information the agent uses to pick the next action i.e. it is the information used by reinforcement learning algorithms It can be function of history: t Sa = f (Ht)
  • 14. About Reinforcement Learning Reinforcement Learning Components Inside the RL Agent Problems within Reinforcement Learning Rewards Environment State Information State An information state(a.k.a.Markov state) contains all useful information from the history. Definition A state St is Markov if and only if P[St+1|St] = P[St+1|S1, . . . , St] Future is independent of the past given the present” Once the state is known, the history may be thrown away i.e. The state is a sufficient statistic of the future The environment state Se is Markov t
  • 15. About Reinforcement Learning Reinforcement Learning Components Inside the RL Agent Problems within Reinforcement Learning Rewards Environment State Fully Observable Environments Full observability: agent directly observes environment St = Sa = Se t t Agent state = environment state = information state Formally, this is a Markov decision process (MDP)
  • 16. About Reinforcement Learning Reinforcement Learning Components Inside the RL Agent Problems within Reinforcement Learning Rewards Environment State Partially Observable Environments Partial observability: agent indirectly observes environment: A robot with camera vision isn’t told its absolute location Now agent state = / environment state, formally this is a partially observable Markov decision process(POMDP)
  • 17. About Reinforcement Learning Reinforcement Learning Components Inside the RL Agent Problems within Reinforcement Learning Major Components of the RL Agent RL agent may include one or more of these components: Policy: agent’s behavior /what actions can perform Value function: prediction of future reward Model: agent’s view of the environment Inside the RL Agent
  • 18. About Reinforcement Learning The Reinforcement Learning Problem Inside the RL Agent Problems within Reinforcement Learning Policy A policy is the agent’s behaviour It is a map from state to action Deterministic policy: a = π(s) Stochastic policy: π(a|s) = P[At = a|St = s] Inside the RL Agent
  • 19. About Reinforcement Learning The Reinforcement Learning Problem Inside the RL Agent Problems within Reinforcement Learning Value Function Value function is a prediction of future reward Used to evaluate the goodness/badness of states&/actions Therefore to select between actions, needs sum of cumulative rewards vπ(s) = Eπ[Rt + γRt+1 + γ2Rt+2 + . . . |St = s] Inside the RL Agent
  • 20. About Reinforcement Learning The Reinforcement Learning Problem Inside the RL Agent Problems within Reinforcement Learning Model Agent’s view of the environment P predicts the next state (Transition Model) R predicts the next (immediate) reward, (Reward Model) ss′ P = P[St+1 a J t t = s |S = s, A = a] a s R = E[Rt+1 t t |S = s, A = a] Inside the RL Agent
  • 21. About Reinforcement Learning The Reinforcement Learning Problem Inside the RL Agent Problems within Reinforcement Learning Maze Example Rewards: -1 per time-step Actions: N, E, S, W States: Agent’s location
  • 22. About Reinforcement Learning The Reinforcement Learning Problem Inside the RL Agent Problems within Reinforcement Learning Maze Example Policy Arrows represent policy π(s) for each state s
  • 23. About Reinforcement Learning The Reinforcement Learning Problem Inside the RL Agent Problems within Reinforcement Learning Maze Example: Value Function Numbers represent value vπ(s) of each state s (No. of steps the agent is away from the goal)
  • 24. About Reinforcement Learning The Reinforcement Learning Problem Inside the RL Agent Problems within Reinforcement Learning Maze Example: Model Grid layout represents transition model P a ss′ s Numbers represent immediate reward Ra for each state s (same for all a)
  • 25. About Reinforcement Learning The Reinforcement Learning Problem Inside the RL Agent Problems within Reinforcement Learning Categorizing RL agents (1) Value Based No Policy (Implicit) Value Function Policy Based Policy No Value Function Actor Critic Policy Value Function
  • 26. About Reinforcement Learning The Reinforcement Learning Problem Inside the RL Agent Problems within Reinforcement Learning Categorizing RL agents (2) Model Free Policy and/or Value Function No Model Model Based Policy and/or Value Function Model
  • 27. About Reinforcement Learning The Reinforcement Learning Problem Inside the RL Agent Problems within Reinforcement Learning Reinforcement Learning Agent Taxonomy Figure: Reinforcement Learning Agent Taxonomy.
  • 28. About Reinforcement Learning The Reinforcement Learning Problem Inside the RL Agent Problems within Reinforcement Learning Learning and Planning Two fundamental problems in sequential decision making Reinforcement Learning: The environment model is initially unknown The agent interacts with the environment The agent improves its policy Planning: A model of the environment is known The agent performs computations with its model (without any interaction) The agent improves its policy a.k.a. deliberation, reasoning, introspection, pondering, thought, search
  • 29. About Reinforcement Learning The Reinforcement Learning Problem Inside the RL Agent Problems within Reinforcement Learning Exploration and Exploitation Reinforcement learning is like trial-and-error learning The agent should discover a good policy From its experiences of the environment Without losing too much reward along the way Exploration finds more information about the environment Exploitation exploits known information to maximise reward It is usually important to explore as well as exploit
  • 30. About Reinforcement Learning The Reinforcement Learning Problem Inside the RL Agent Problems within Reinforcement Learning Exploration and Exploitation Examples Restaurant Selection Exploitation: Go to your favourite restaurant Exploration: Try a new restaurant Oil Drilling Exploitation: Drill at the best known location Exploration: Drill at a new location Game Playing Exploitation: Play the move you believe is best Exploration: Play an experimental move Online Banner Advertisements Exploitation: Show the most successful advert Exploration: Show a different advert
  • 31. About Reinforcement Learning The Reinforcement Learning Problem Inside the RL Agent Problems within Reinforcement Learning Prediction and Control Prediction: evaluate the future Given a policy Control: optimise the future Find the best policy
  • 33. Introduction Markov Process Markov Reward Process Markov Decision Process Overview Markov Decision Process
  • 34. Introduction Markov Process Markov Reward Process Markov Decision Process Introduction to MDP Markov decision process is a mathematical environment for reinforcement learning Where the environment is fully observable Almost all RL problems can be formalised as MDPs Partially observable problems can be converted into MDPs Markov Decision Process
  • 35. Introduction Markov Process Markov Reward Process Markov Decision Process Markov Property Markov Chain Markov Property ”The future is independent of the past given the present” Definition A state St is Markov it determines being in one state what can be the transition probability to other state P[St +1|St ] = P[St +1|S1,..,St ] The state captures all relevant information from the history Once the state is known, the history may be thrown away i.e. The state is a sufficient statistic of the future MDPs Markov Decision Process
  • 36. Introduction Markov Process Markov Reward Process Markov Decision Process Markov Property Markov Chain State Transition Matrix For a Markov state s and successor state s’, the state transition probability is denoted by Pss′ = P[St +1 = s’|St = s] State transition matrix P denotes transition probabilities from all states s to all successor states s’, ... P = from . ... . ... Pn1 Pnn P11 P1n where each row of the matrix sums to 1. Markov Decision Process
  • 37. Introduction Markov Process Markov Reward Process Markov Decision Process Markov Property Markov Chain Markov Process A Markov process is a memoryless process, i.e. a sequence of states S1,S2,... with the Markov property. Definition A Markov Process (or Markov Chain) is a tuple < S ,P > S is a (finite) set of states P is a state transition probability matrix, Pss' = P[St+1 = s′|St = s] Markov Decision Process
  • 38. Introduction Markov Process Markov Reward Process Markov Decision Process Markov Property Markov Chain Example: Student Markov Chain/Process Markov Decision Process
  • 39. Introduction Markov Process Markov Reward Process Markov Decision Process Markov Property Markov Chain Example: Student Markov Process Episodes Sample episodes for Student Markov Chain starting from S1 = C1 S1,S2,...,ST C1 C2 C3 Pass Sleep C1 FB FB C1 C2 Sleep C1 C2 C3 Pub C2 C3 Pass Sleep C1 FB FB C1 C2 C3 Pub C1 FB FB FB C1 C2 C3 Pub C2 Sleep Markov Decision Process
  • 40. Introduction Markov Process Markov Reward Process Markov Decision Process Markov Property Markov Chain Example: Student Markov Chain Transition Matrix P = Sleep c1 c2 0.2 1.0 Pub 0.2 0.4 0.4 c1 c2 c3 Pass Pub FB 0.5 0.5 0.8 c3 0.6 0.4 Pass FB 0.1 0.9 Sleep 1 Markov Decision Process
  • 41. Introduction Markov Process Markov Reward Process Markov Decision Process Return Value function Bellman Equation Markov Reward Process A Markov reward process is a Markov chain with rewards. Definition A Markov Reward Process is a tuple < S ,P ,R ,γ > S is a (finite) set of states P is a state transition probability matrix, Pss' = P[St+1 = s′|St = s] R is a reward function, Rs = E[Rt+1|St = s] γ is a discount factor, γ ∈ [0;1] Markov Decision Process
  • 42. Introduction Markov Process Markov Reward Process Markov Decision Process Return Value function Bellman Equation Example: Student MRP Markov Decision Process
  • 43. Introduction Markov Process Markov Reward Process Markov Decision Process Return Value function Bellman Equation Returns Definition The return Gt is the total discounted reward from time-step t. Gt = Rt +1 + γRt +2 + γ2Rt +3 + ... t inf Σ G = Rt +k+1 k=0 The discount γ ∈ [0,1] is the present value of future rewards The value of receiving reward R after k + 1 time-steps is γkR Markov Decision Process
  • 44. Introduction Markov Process Markov Reward Process Markov Decision Process Return Value function Bellman Equation Why discount? Most Markov reward and decision processes are discounted. Why? Mathematically convenient to use Avoids infinite returns in cyclic Markov processes If the reward is financial, immediate rewards may earn more interest than delayed rewards Human behaviour shows preference for immediate reward Markov Decision Process
  • 45. Introduction Markov Process Markov Reward Process Markov Decision Process Return Value function Bellman Equation Value Function The value function v(s) gives the long-term value of state s Definition The state value function v(s) of an MRP is the expected return starting from state s v(s) = E[Gt |St = s] Markov Decision Process
  • 46. Introduction Markov Process Markov Reward Process Markov Decision Process Return Value function Bellman Equation Example: Student MRP Returns Sample returns for Student MRP: Starting from S1 = C1 with γ = 1 2 T 2 G1 = R2 + γR3 + · ·· + γ RT − 2 C1 C2 C3 Pass Sleep C1 FB FB C1 C2 Sleep C1 C2 C3 Pub C2 C3 Pass Sleep C1 FB FB C1 C2 C3 Pub C1 FB FB FB C1 C2 C3 Pub C2 Sleep 1 2 1 4 1 8 G1 = -2 - 2 × − 2 × + 10 × = −2.25 1 1 1 1 2 4 8 16 G1 = -2 - 1 × − 1 × − 2 × − 2 × = −3.125 1 1 1 1 2 4 8 16 G1 = -2 -2 × − 2 × + 1 × − 2 × + ···= − 3.41 1 1 1 1 2 4 8 16 G1 = -2 - 1 × − 1 × − 2 × − 2 × + ···= − 3.20 Markov Decision Process
  • 47. Introduction Markov Process Markov Reward Process Markov Decision Process Return Value function Bellman Equation Example: State-Value Function for Student MRP(1) Markov Decision Process
  • 48. Introduction Markov Process Markov Reward Process Markov Decision Process Return Value function Bellman Equation Example: State-Value Function for Student MRP(2) Markov Decision Process
  • 49. Introduction Markov Process Markov Reward Process Markov Decision Process Return Value function Bellman Equation Example: State-Value Function for Student MRP(3) Markov Decision Process
  • 50. Introduction Markov Process Markov Reward Process Markov Decision Process Return Value function Bellman Equation Bellman Equation for MRPs The value function can be decomposed into two parts: immediate reward Rt+1 discounted value of successor state (St+1) v(s) = E[Gt |St = s] = E[Rt +1 + γRt +2 + γ2Rt +3 + ...|St = s] = E[Rt +1 + γ(Rt +2 + γRt +3 + ...)|St = s] = E[Rt +1 + γV (St +1)|St = s] Markov Decision Process
  • 51. Introduction Markov Process Markov Reward Process Markov Decision Process Return Value function Bellman Equation Example: Bellman Equation for Student MRP Markov Decision Process
  • 52. Introduction Markov Process Markov Reward Process Markov Decision Process Return Value function Bellman Equation Bellman Equation in Matrix Form The Bellman equation can be expressed concisely using matrices, v = R + γP v where v is a column vector with one entry per state v(n) = v(1) R . . + γ P ... P 1 11 1n R n ... Pn1 Pnn v(1) . . . . v(n) Markov Decision Process
  • 53. Introduction Markov Process Markov Reward Process Markov Decision Process Markov Decision Process Value Function Bellman Expectation Equation Markov Decision Process A Markov decision process (MDP) is a Markov reward process with decisions. It is an environment in which all states are Markov. Definition A Markov Reward Process is a tuple < S ,A ,P ,R ,γ > S is a (finite) set of states A is a finite set of actions P is a state transition probability matrix, Pa ′ ss t +1 t t ' = P[S = s |S = s,A = a] a s a t +1 R is a reward function, R = E[R |St = s,A = a] γ is a discount factor, γ ∈ [0;1] Markov Decision Process
  • 54. Introduction Markov Process Markov Reward Process Markov Decision Process Example: Student MDP Markov Decision Process Markov Decision Process Value Function Bellman Expectation Equation
  • 55. Introduction Markov Process Markov Reward Process Markov Decision Process Value Function Definition The state-value function vπ(s) of an MDP is the expected return starting from state s, and then following policy vπ(s) = Eπ[Gt |St = s] Definition The action-value function qπ(s,a) is the expected return starting from state s, taking action a, and then following policy π qπ(s,a) = Eπ[Gt |St = s,At = a] Markov Decision Process Markov Decision Process Value Function Bellman Expectation Equation
  • 56. Introduction Markov Process Markov Reward Process Markov Decision Process Example: State-Value Function for Student MDP Markov Decision Process Markov Decision Process Value Function Bellman Expectation Equation
  • 57. Introduction Markov Process Markov Reward Process Markov Decision Process Bellman Expectation Equation The state-value function can again be decomposed into immediate reward plus discounted value of successor state, vπ(s) = Eπ[Rt +1 + γvπ(st +1)|St = s] The action-value function can similarly be decomposed, qπ(s,a) = Eπ[Rt+1 + γqπ(St+1, At+1)|St = s,At = a] Markov Decision Process Markov Decision Process Value Function Bellman Expectation Equation
  • 58. Introduction Markov Process Markov Reward Process Markov Decision Process Example: Bellman Expectation Equation for Student MDP Markov Decision Process Markov Decision Process Value Function Bellman Expectation Equation
  • 59. Conclusion Reinforcement Learning is one of the most interesting and useful parts of Machine learning. In RL, the agent explores the environment without any human intervention. It is the main learning algorithm that is used in Artificial Intelligence. The main issue with the RL algorithm is that some of the parameters may affect the speed of the learning, such as delayed feedback.