RL_Dr.SNR Final ppt for Presentation 28.05.2021.pptx

Reinforcement Learning
Dr.S.Nagaraju

Overview
1 About Reinforcement Learning
2 The Reinforcement Learning Problem
Rewards
Environment
State
3 Inside an RL Agent
4 Problems within Reinforcement Learning & MDP& MDP

About Reinforcement Learning
The Reinforcement Learning
Inside the RL Agent
Problems within Reinforcement Learning
Many Facets of Reinforcement Learning
Figure: Many Facets of Reinforcement Learning.

Branches of Machine Learning
Figure: Branches of Machine Learning.
Inside the RL Agent

 Science of sequence of decision making deals with agents that must sense
& act upon their environment.
 An agent learns to behave in an environment by performing the actions
and seeing the results of previous actions.
 For each good action, the agent gets positive feedback, and for each
bad action, the agent gets negative feedback.
 In RL, the agent learns automatically using feedbacks without any
labelled data, unlike supervised learning.
What is Reinforcement Learning?
Inside the RL Agent

Characteristics of Reinforcement Learning
What makes reinforcement learning different from other machine
learning paradigms?
No supervisor, only a reward signal
Feedback is delayed, not instantaneous
Time really matters(Sequential, non i.i.d data)
Agent’s actions affect the subsequent data it receives
Inside the RL Agent

The ReinforcementLearningComponents
Rewards
A reward Rt is a scalar feedback signal
Indicates how well agent is doing at time-step t
The agent’s job is to maximise cumulative reward
Reinforcement learning is based on the reward hypothesis
Definition (Reward Hypothesis)
All goals can be described by the maximisation of expected
cumulative reward
Inside the RL Agent

Reinforcement Learning Components
Inside the RL Agent
Rewards
Environment
State
Examples of Rewards
Fly stunt manoeuvres in a helicopter
+ve reward for following desired trajectory
-ve reward for crashing walk
Defeat the world champion
+/-ve reward for winning/losing a game

Inside the RL Agent
Rewards
Environment
State
Examples of Rewards
Control a power station
+ve reward for producing power
-ve reward for exceeding safety thresholds
Make a humanoid robot walk
+ve reward for forward motion
-ve reward for falling over

About Reinforcement Learning The
Inside the RL Agent
Rewards
Environment
State
Agent and Environment
At each step t the agent:
Executes action At
Receives observation Ot
Receives scalar reward
Rt
The environment:
Receives action At
Emits observation Ot
Emits scalar reward Rt
t increments at env. step

Inside the RL Agent
Rewards
Environment
State
History and State
The history is the sequence of actions, observations, rewards
Ht = A1, O1, R1, . . . , At, Ot, Rt
What happens next depends on the history:
The agent selects actions
The environment selects observations/rewards
State is the concise information used to determine what
happens next
Formally, state is a function of the history:
St = f (Ht)

Inside the RL Agent
Rewards
Environment
State
Environment State
The environment state Se
t
is the environment data
representation
i.e. whatever data the
environment uses to pick
the next
observation/reward
The environment can be
real-world or simulated

Inside the RL Agent
Rewards
Environment
State
Agent State
t
The agent state Sa is the
agent’s internal
representation
i.e. whatever information
the agent uses to pick the
next action
i.e. it is the information
used by reinforcement
learning algorithms
It can be function of
history:
t
Sa = f (Ht)

Inside the RL Agent
Rewards
Environment
State
Information State
An information state(a.k.a.Markov state) contains all useful
information from the history.
Definition
A state St is Markov if and only if
P[St+1|St] = P[St+1|S1, . . . , St]
Future is independent of the past given the present”
Once the state is known, the history may be thrown away
i.e. The state is a sufficient statistic of the future
The environment state Se is Markov
t

Inside the RL Agent
Rewards
Environment
State
Fully Observable Environments
Full observability: agent directly
observes environment
St = Sa = Se
t t
Agent state = environment
state = information state
Formally, this is a Markov
decision process (MDP)

Inside the RL Agent
Rewards
Environment
State
Partially Observable Environments
Partial observability: agent indirectly observes environment:
A robot with camera vision isn’t told its absolute location
Now agent state =
/ environment state, formally this is a
partially observable Markov decision process(POMDP)

Inside the RL Agent
Major Components of the RL Agent
RL agent may include one or more of these components:
Policy: agent’s behavior /what actions can perform
Value function: prediction of future reward
Model: agent’s view of the environment
Inside the RL Agent

The Reinforcement Learning Problem
Inside the RL Agent
Policy
A policy is the agent’s behaviour
It is a map from state to action
Deterministic policy: a = π(s)
Stochastic policy: π(a|s) = P[At = a|St = s]
Inside the RL Agent

Inside the RL Agent
Value Function
Value function is a prediction of future reward
Used to evaluate the goodness/badness of states&/actions
Therefore to select between actions, needs sum of cumulative
rewards
vπ(s) = Eπ[Rt + γRt+1 + γ2Rt+2 + . . . |St = s]
Inside the RL Agent

Inside the RL Agent
Model
Agent’s view of the environment
P predicts the next state (Transition Model)
R predicts the next (immediate) reward, (Reward Model)
ss′
P = P[St+1
a J
t t
= s |S = s, A = a]
a
s
R = E[Rt+1 t t
|S = s, A = a]
Inside the RL Agent

Inside the RL Agent
Maze Example
Rewards: -1 per
time-step
Actions: N, E, S, W
States: Agent’s location

Inside the RL Agent
Maze Example Policy
Arrows represent policy π(s) for each state s

Inside the RL Agent
Maze Example: Value Function
Numbers represent value vπ(s) of each state s (No. of steps the
agent is away from the goal)

Inside the RL Agent
Maze Example: Model
Grid layout represents transition model P a
ss′
s
Numbers represent immediate reward Ra for each state s
(same for all a)

Inside the RL Agent
Categorizing RL agents (1)
Value Based
No Policy (Implicit)
Value Function
Policy Based
Policy
No Value Function
Actor Critic
Policy
Value Function

Inside the RL Agent
Categorizing RL agents (2)
Model Free
Policy and/or Value Function
No Model
Model Based
Policy and/or Value Function
Model

Inside the RL Agent
Reinforcement Learning Agent Taxonomy
Figure: Reinforcement Learning Agent Taxonomy.

Inside the RL Agent
Learning and Planning
Two fundamental problems in sequential decision making
Reinforcement Learning:
The environment model is initially unknown
The agent interacts with the environment
The agent improves its policy
Planning:
A model of the environment is known
The agent performs computations with its model (without any
interaction)
The agent improves its policy
a.k.a. deliberation, reasoning, introspection, pondering,
thought, search

Inside the RL Agent
Exploration and Exploitation
Reinforcement learning is like trial-and-error learning
The agent should discover a good policy
From its experiences of the environment
Without losing too much reward along the way
Exploration finds more information about the environment
Exploitation exploits known information to maximise reward
It is usually important to explore as well as exploit

Inside the RL Agent
Exploration and Exploitation Examples
Restaurant Selection
Exploitation: Go to your favourite restaurant
Exploration: Try a new restaurant
Oil Drilling
Exploitation: Drill at the best known location
Exploration: Drill at a new location
Game Playing
Exploitation: Play the move you believe is best
Exploration: Play an experimental move
Online Banner Advertisements
Exploitation: Show the most successful advert
Exploration: Show a different advert

Inside the RL Agent
Prediction and Control
Prediction: evaluate the future
Given a policy
Control: optimise the future
Find the best policy

Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Overview

Introduction
Markov Process
Introduction to MDP
Markov decision process is a mathematical environment for
reinforcement learning
Where the environment is fully observable
Almost all RL problems can be formalised as MDPs
Partially observable problems can be converted into MDPs

Introduction
Markov Process
Markov Property
Markov Chain
Markov Property
”The future is independent of the past given the present”
Definition
A state St is Markov it determines being in one state what can be
the transition probability to other state P[St +1|St ] = P[St +1|S1,..,St ]
The state captures all relevant information from the history
Once the state is known, the history may be thrown away
i.e. The state is a sufficient statistic of the future MDPs

Introduction
Markov Process
Markov Property
Markov Chain
State Transition Matrix
For a Markov state s and successor state s’, the state
transition probability is denoted by
Pss′ = P[St +1 = s’|St = s]
State transition matrix P denotes transition probabilities from
all states s to all successor states s’,
...
P = from . ... .
...
Pn1 Pnn
P11 P1n
where each row of the matrix sums to 1.

Introduction
Markov Process
Markov Property
Markov Chain
Markov Process
A Markov process is a memoryless process, i.e. a sequence of
states S1,S2,... with the Markov property.
Definition
A Markov Process (or Markov Chain) is a tuple < S ,P >
S is a (finite) set of states
P is a state transition probability matrix,
Pss' = P[St+1 = s′|St = s]

Introduction
Markov Process
Markov Property
Markov Chain
Example: Student Markov Chain/Process

Introduction
Markov Process
Markov Property
Markov Chain
Example: Student Markov Process Episodes
Sample episodes for Student Markov
Chain starting from S1 = C1
S1,S2,...,ST
C1 C2 C3 Pass Sleep
C1 FB FB C1 C2 Sleep
C1 C2 C3 Pub C2 C3 Pass
Sleep
C1 FB FB C1 C2 C3 Pub C1 FB
FB FB C1 C2 C3 Pub C2 Sleep

Introduction
Markov Process
Markov Property
Markov Chain
Example: Student Markov Chain Transition Matrix
P =
Sleep
c1
c2 0.2
1.0
Pub 0.2 0.4 0.4
c1 c2 c3 Pass Pub FB
0.5 0.5
0.8
c3 0.6 0.4
Pass
FB 0.1 0.9
Sleep 1

Introduction
Markov Process
Return
Value function
Bellman Equation
A Markov reward process is a Markov chain with rewards.
Definition
A Markov Reward Process is a tuple < S ,P ,R ,γ >
Pss' = P[St+1 = s′|St = s]
R is a reward function, Rs = E[Rt+1|St = s]
γ is a discount factor, γ ∈ [0;1]

Introduction
Markov Process
Return
Value function
Bellman Equation
Example: Student MRP

Introduction
Markov Process
Return
Value function
Bellman Equation
Returns
Definition
The return Gt is the total discounted reward from time-step t.
Gt = Rt +1 + γRt +2 + γ2Rt +3 + ...
t
inf
Σ
G = Rt +k+1
k=0
The discount γ ∈ [0,1] is the present value of future rewards
The value of receiving reward R after k + 1 time-steps is γkR

Introduction
Markov Process
Return
Value function
Bellman Equation
Why discount?
Most Markov reward and decision processes are discounted. Why?
Mathematically convenient to use
Avoids infinite returns in cyclic Markov processes
If the reward is financial, immediate rewards may earn more
interest than delayed rewards
Human behaviour shows preference for immediate reward

Introduction
Markov Process
Return
Value function
Bellman Equation
Value Function
The value function v(s) gives the long-term value of state s
Definition
The state value function v(s) of an MRP is the expected return
starting from state s
v(s) = E[Gt |St = s]

Introduction
Markov Process
Return
Value function
Bellman Equation
Example: Student MRP Returns
Sample returns for Student MRP:
Starting from S1 = C1 with γ = 1
2
T 2
G1 = R2 + γR3 + · ·· + γ RT − 2
C1 C2 C3 Pass Sleep
C1 FB FB C1 C2 Sleep
C1 C2 C3 Pub C2 C3 Pass Sleep
C1 FB FB C1 C2 C3 Pub C1 FB FB
FB C1 C2 C3 Pub C2 Sleep
1
2
1
4
1
8
G1 = -2 - 2 × − 2 × + 10 × = −2.25
1 1 1 1
2 4 8 16
G1 = -2 - 1 × − 1 × − 2 × − 2 × = −3.125
1 1 1 1
2 4 8 16
G1 = -2 -2 × − 2 × + 1 × − 2 × + ···= − 3.41
1 1 1 1
2 4 8 16
G1 = -2 - 1 × − 1 × − 2 × − 2 × + ···= − 3.20

Introduction
Markov Process
Return
Value function
Bellman Equation
Example: State-Value Function for Student MRP(1)

Introduction
Markov Process
Return
Value function
Bellman Equation

Introduction
Markov Process
Return
Value function
Bellman Equation
Bellman Equation for MRPs
The value function can be decomposed into two parts:
immediate reward Rt+1
discounted value of successor state (St+1)
v(s) = E[Gt |St = s]
= E[Rt +1 + γRt +2 + γ2Rt +3 + ...|St = s]
= E[Rt +1 + γ(Rt +2 + γRt +3 + ...)|St = s]
= E[Rt +1 + γV (St +1)|St = s]

Introduction
Markov Process
Return
Value function
Bellman Equation
Example: Bellman Equation for Student MRP

Introduction
Markov Process
Return
Value function
Bellman Equation
Bellman Equation in Matrix Form
The Bellman equation can be expressed concisely using matrices,
v = R + γP v
where v is a column vector with one entry per state
v(n)
=
v(1) R
. .
+ γ
P ... P
1 11 1n
R n ...
Pn1 Pnn
v(1)
.
. . .
v(n)

Introduction
Markov Process
Value Function
Bellman Expectation Equation
A Markov decision process (MDP) is a Markov reward process with
decisions. It is an environment in which all states are Markov.
Definition
A Markov Reward Process is a tuple < S ,A ,P ,R ,γ >
A is a finite set of actions
Pa ′
ss t +1 t t
' = P[S = s |S = s,A = a]
a
s
a
t +1
R is a reward function, R = E[R |St = s,A = a]
γ is a discount factor, γ ∈ [0;1]

Introduction
Markov Process
Example: Student MDP
Value Function

Introduction
Markov Process
Value Function
Definition
The state-value function vπ(s) of an MDP is the expected return
starting from state s, and then following policy
vπ(s) = Eπ[Gt |St = s]
Definition
The action-value function qπ(s,a) is the expected return starting
from state s, taking action a, and then following policy π
qπ(s,a) = Eπ[Gt |St = s,At = a]
Value Function

Introduction
Markov Process
Example: State-Value Function for Student MDP
Value Function

Introduction
Markov Process
The state-value function can again be decomposed into immediate
reward plus discounted value of successor state,
vπ(s) = Eπ[Rt +1 + γvπ(st +1)|St = s]
The action-value function can similarly be decomposed,
qπ(s,a) = Eπ[Rt+1 + γqπ(St+1, At+1)|St = s,At = a]
Value Function

Introduction
Markov Process
Example: Bellman Expectation Equation for Student MDP
Value Function

Conclusion
Reinforcement Learning is one of the most interesting and useful
parts of Machine learning.
In RL, the agent explores the environment without any human
intervention. It is the main learning algorithm that is used in
Artificial Intelligence.
The main issue with the RL algorithm is that some of the
parameters may affect the speed of the learning, such as delayed
feedback.

RL_Dr.SNR Final ppt for Presentation 28.05.2021.pptx

Recommended

Recommended

More Related Content

Similar to RL_Dr.SNR Final ppt for Presentation 28.05.2021.pptx

Similar to RL_Dr.SNR Final ppt for Presentation 28.05.2021.pptx (20)

Recently uploaded

Recently uploaded (20)

RL_Dr.SNR Final ppt for Presentation 28.05.2021.pptx