2. Overview
1 About Reinforcement Learning
2 The Reinforcement Learning Problem
Rewards
Environment
State
3 Inside an RL Agent
4 Problems within Reinforcement Learning & MDP& MDP
3. About Reinforcement Learning
The Reinforcement Learning
Inside the RL Agent
Problems within Reinforcement Learning
Many Facets of Reinforcement Learning
Figure: Many Facets of Reinforcement Learning.
About Reinforcement Learning
4. Branches of Machine Learning
Figure: Branches of Machine Learning.
About Reinforcement Learning
About Reinforcement Learning
The Reinforcement Learning
Inside the RL Agent
Problems within Reinforcement Learning
5. Science of sequence of decision making deals with agents that must sense
& act upon their environment.
An agent learns to behave in an environment by performing the actions
and seeing the results of previous actions.
For each good action, the agent gets positive feedback, and for each
bad action, the agent gets negative feedback.
In RL, the agent learns automatically using feedbacks without any
labelled data, unlike supervised learning.
What is Reinforcement Learning?
About Reinforcement Learning
About Reinforcement Learning
The Reinforcement Learning
Inside the RL Agent
Problems within Reinforcement Learning
6. Characteristics of Reinforcement Learning
What makes reinforcement learning different from other machine
learning paradigms?
No supervisor, only a reward signal
Feedback is delayed, not instantaneous
Time really matters(Sequential, non i.i.d data)
Agent’s actions affect the subsequent data it receives
About Reinforcement Learning
About Reinforcement Learning
The Reinforcement Learning
Inside the RL Agent
Problems within Reinforcement Learning
7. The ReinforcementLearningComponents
Rewards
A reward Rt is a scalar feedback signal
Indicates how well agent is doing at time-step t
The agent’s job is to maximise cumulative reward
Reinforcement learning is based on the reward hypothesis
Definition (Reward Hypothesis)
All goals can be described by the maximisation of expected
cumulative reward
About Reinforcement Learning
The Reinforcement Learning
Inside the RL Agent
Problems within Reinforcement Learning
8. About Reinforcement Learning
Reinforcement Learning Components
Inside the RL Agent
Problems within Reinforcement Learning
Rewards
Environment
State
Examples of Rewards
Fly stunt manoeuvres in a helicopter
+ve reward for following desired trajectory
-ve reward for crashing walk
Defeat the world champion
+/-ve reward for winning/losing a game
9. About Reinforcement Learning
Reinforcement Learning Components
Inside the RL Agent
Problems within Reinforcement Learning
Rewards
Environment
State
Examples of Rewards
Control a power station
+ve reward for producing power
-ve reward for exceeding safety thresholds
Make a humanoid robot walk
+ve reward for forward motion
-ve reward for falling over
10. About Reinforcement Learning The
Reinforcement Learning Components
Inside the RL Agent
Problems within Reinforcement Learning
Rewards
Environment
State
Agent and Environment
At each step t the agent:
Executes action At
Receives observation Ot
Receives scalar reward
Rt
The environment:
Receives action At
Emits observation Ot
Emits scalar reward Rt
t increments at env. step
11. About Reinforcement Learning
Reinforcement Learning Components
Inside the RL Agent
Problems within Reinforcement Learning
Rewards
Environment
State
History and State
The history is the sequence of actions, observations, rewards
Ht = A1, O1, R1, . . . , At, Ot, Rt
What happens next depends on the history:
The agent selects actions
The environment selects observations/rewards
State is the concise information used to determine what
happens next
Formally, state is a function of the history:
St = f (Ht)
12. About Reinforcement Learning
Reinforcement Learning Components
Inside the RL Agent
Problems within Reinforcement Learning
Rewards
Environment
State
Environment State
The environment state Se
t
is the environment data
representation
i.e. whatever data the
environment uses to pick
the next
observation/reward
The environment can be
real-world or simulated
13. About Reinforcement Learning
Reinforcement Learning Components
Inside the RL Agent
Problems within Reinforcement Learning
Rewards
Environment
State
Agent State
t
The agent state Sa is the
agent’s internal
representation
i.e. whatever information
the agent uses to pick the
next action
i.e. it is the information
used by reinforcement
learning algorithms
It can be function of
history:
t
Sa = f (Ht)
14. About Reinforcement Learning
Reinforcement Learning Components
Inside the RL Agent
Problems within Reinforcement Learning
Rewards
Environment
State
Information State
An information state(a.k.a.Markov state) contains all useful
information from the history.
Definition
A state St is Markov if and only if
P[St+1|St] = P[St+1|S1, . . . , St]
Future is independent of the past given the present”
Once the state is known, the history may be thrown away
i.e. The state is a sufficient statistic of the future
The environment state Se is Markov
t
15. About Reinforcement Learning
Reinforcement Learning Components
Inside the RL Agent
Problems within Reinforcement Learning
Rewards
Environment
State
Fully Observable Environments
Full observability: agent directly
observes environment
St = Sa = Se
t t
Agent state = environment
state = information state
Formally, this is a Markov
decision process (MDP)
16. About Reinforcement Learning
Reinforcement Learning Components
Inside the RL Agent
Problems within Reinforcement Learning
Rewards
Environment
State
Partially Observable Environments
Partial observability: agent indirectly observes environment:
A robot with camera vision isn’t told its absolute location
Now agent state =
/ environment state, formally this is a
partially observable Markov decision process(POMDP)
17. About Reinforcement Learning
Reinforcement Learning Components
Inside the RL Agent
Problems within Reinforcement Learning
Major Components of the RL Agent
RL agent may include one or more of these components:
Policy: agent’s behavior /what actions can perform
Value function: prediction of future reward
Model: agent’s view of the environment
Inside the RL Agent
18. About Reinforcement Learning
The Reinforcement Learning Problem
Inside the RL Agent
Problems within Reinforcement Learning
Policy
A policy is the agent’s behaviour
It is a map from state to action
Deterministic policy: a = π(s)
Stochastic policy: π(a|s) = P[At = a|St = s]
Inside the RL Agent
19. About Reinforcement Learning
The Reinforcement Learning Problem
Inside the RL Agent
Problems within Reinforcement Learning
Value Function
Value function is a prediction of future reward
Used to evaluate the goodness/badness of states&/actions
Therefore to select between actions, needs sum of cumulative
rewards
vπ(s) = Eπ[Rt + γRt+1 + γ2Rt+2 + . . . |St = s]
Inside the RL Agent
20. About Reinforcement Learning
The Reinforcement Learning Problem
Inside the RL Agent
Problems within Reinforcement Learning
Model
Agent’s view of the environment
P predicts the next state (Transition Model)
R predicts the next (immediate) reward, (Reward Model)
ss′
P = P[St+1
a J
t t
= s |S = s, A = a]
a
s
R = E[Rt+1 t t
|S = s, A = a]
Inside the RL Agent
21. About Reinforcement Learning
The Reinforcement Learning Problem
Inside the RL Agent
Problems within Reinforcement Learning
Maze Example
Rewards: -1 per
time-step
Actions: N, E, S, W
States: Agent’s location
22. About Reinforcement Learning
The Reinforcement Learning Problem
Inside the RL Agent
Problems within Reinforcement Learning
Maze Example Policy
Arrows represent policy π(s) for each state s
23. About Reinforcement Learning
The Reinforcement Learning Problem
Inside the RL Agent
Problems within Reinforcement Learning
Maze Example: Value Function
Numbers represent value vπ(s) of each state s (No. of steps the
agent is away from the goal)
24. About Reinforcement Learning
The Reinforcement Learning Problem
Inside the RL Agent
Problems within Reinforcement Learning
Maze Example: Model
Grid layout represents transition model P a
ss′
s
Numbers represent immediate reward Ra for each state s
(same for all a)
25. About Reinforcement Learning
The Reinforcement Learning Problem
Inside the RL Agent
Problems within Reinforcement Learning
Categorizing RL agents (1)
Value Based
No Policy (Implicit)
Value Function
Policy Based
Policy
No Value Function
Actor Critic
Policy
Value Function
26. About Reinforcement Learning
The Reinforcement Learning Problem
Inside the RL Agent
Problems within Reinforcement Learning
Categorizing RL agents (2)
Model Free
Policy and/or Value Function
No Model
Model Based
Policy and/or Value Function
Model
27. About Reinforcement Learning
The Reinforcement Learning Problem
Inside the RL Agent
Problems within Reinforcement Learning
Reinforcement Learning Agent Taxonomy
Figure: Reinforcement Learning Agent Taxonomy.
28. About Reinforcement Learning
The Reinforcement Learning Problem
Inside the RL Agent
Problems within Reinforcement Learning
Learning and Planning
Two fundamental problems in sequential decision making
Reinforcement Learning:
The environment model is initially unknown
The agent interacts with the environment
The agent improves its policy
Planning:
A model of the environment is known
The agent performs computations with its model (without any
interaction)
The agent improves its policy
a.k.a. deliberation, reasoning, introspection, pondering,
thought, search
29. About Reinforcement Learning
The Reinforcement Learning Problem
Inside the RL Agent
Problems within Reinforcement Learning
Exploration and Exploitation
Reinforcement learning is like trial-and-error learning
The agent should discover a good policy
From its experiences of the environment
Without losing too much reward along the way
Exploration finds more information about the environment
Exploitation exploits known information to maximise reward
It is usually important to explore as well as exploit
30. About Reinforcement Learning
The Reinforcement Learning Problem
Inside the RL Agent
Problems within Reinforcement Learning
Exploration and Exploitation Examples
Restaurant Selection
Exploitation: Go to your favourite restaurant
Exploration: Try a new restaurant
Oil Drilling
Exploitation: Drill at the best known location
Exploration: Drill at a new location
Game Playing
Exploitation: Play the move you believe is best
Exploration: Play an experimental move
Online Banner Advertisements
Exploitation: Show the most successful advert
Exploration: Show a different advert
31. About Reinforcement Learning
The Reinforcement Learning Problem
Inside the RL Agent
Problems within Reinforcement Learning
Prediction and Control
Prediction: evaluate the future
Given a policy
Control: optimise the future
Find the best policy
34. Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Introduction to MDP
Markov decision process is a mathematical environment for
reinforcement learning
Where the environment is fully observable
Almost all RL problems can be formalised as MDPs
Partially observable problems can be converted into MDPs
Markov Decision Process
35. Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Markov Property
Markov Chain
Markov Property
”The future is independent of the past given the present”
Definition
A state St is Markov it determines being in one state what can be
the transition probability to other state P[St +1|St ] = P[St +1|S1,..,St ]
The state captures all relevant information from the history
Once the state is known, the history may be thrown away
i.e. The state is a sufficient statistic of the future MDPs
Markov Decision Process
36. Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Markov Property
Markov Chain
State Transition Matrix
For a Markov state s and successor state s’, the state
transition probability is denoted by
Pss′ = P[St +1 = s’|St = s]
State transition matrix P denotes transition probabilities from
all states s to all successor states s’,
...
P = from . ... .
...
Pn1 Pnn
P11 P1n
where each row of the matrix sums to 1.
Markov Decision Process
37. Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Markov Property
Markov Chain
Markov Process
A Markov process is a memoryless process, i.e. a sequence of
states S1,S2,... with the Markov property.
Definition
A Markov Process (or Markov Chain) is a tuple < S ,P >
S is a (finite) set of states
P is a state transition probability matrix,
Pss' = P[St+1 = s′|St = s]
Markov Decision Process
38. Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Markov Property
Markov Chain
Example: Student Markov Chain/Process
Markov Decision Process
39. Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Markov Property
Markov Chain
Example: Student Markov Process Episodes
Sample episodes for Student Markov
Chain starting from S1 = C1
S1,S2,...,ST
C1 C2 C3 Pass Sleep
C1 FB FB C1 C2 Sleep
C1 C2 C3 Pub C2 C3 Pass
Sleep
C1 FB FB C1 C2 C3 Pub C1 FB
FB FB C1 C2 C3 Pub C2 Sleep
Markov Decision Process
41. Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Return
Value function
Bellman Equation
Markov Reward Process
A Markov reward process is a Markov chain with rewards.
Definition
A Markov Reward Process is a tuple < S ,P ,R ,γ >
S is a (finite) set of states
P is a state transition probability matrix,
Pss' = P[St+1 = s′|St = s]
R is a reward function, Rs = E[Rt+1|St = s]
γ is a discount factor, γ ∈ [0;1]
Markov Decision Process
43. Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Return
Value function
Bellman Equation
Returns
Definition
The return Gt is the total discounted reward from time-step t.
Gt = Rt +1 + γRt +2 + γ2Rt +3 + ...
t
inf
Σ
G = Rt +k+1
k=0
The discount γ ∈ [0,1] is the present value of future rewards
The value of receiving reward R after k + 1 time-steps is γkR
Markov Decision Process
44. Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Return
Value function
Bellman Equation
Why discount?
Most Markov reward and decision processes are discounted. Why?
Mathematically convenient to use
Avoids infinite returns in cyclic Markov processes
If the reward is financial, immediate rewards may earn more
interest than delayed rewards
Human behaviour shows preference for immediate reward
Markov Decision Process
45. Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Return
Value function
Bellman Equation
Value Function
The value function v(s) gives the long-term value of state s
Definition
The state value function v(s) of an MRP is the expected return
starting from state s
v(s) = E[Gt |St = s]
Markov Decision Process
47. Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Return
Value function
Bellman Equation
Example: State-Value Function for Student MRP(1)
Markov Decision Process
48. Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Return
Value function
Bellman Equation
Example: State-Value Function for Student MRP(2)
Markov Decision Process
49. Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Return
Value function
Bellman Equation
Example: State-Value Function for Student MRP(3)
Markov Decision Process
50. Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Return
Value function
Bellman Equation
Bellman Equation for MRPs
The value function can be decomposed into two parts:
immediate reward Rt+1
discounted value of successor state (St+1)
v(s) = E[Gt |St = s]
= E[Rt +1 + γRt +2 + γ2Rt +3 + ...|St = s]
= E[Rt +1 + γ(Rt +2 + γRt +3 + ...)|St = s]
= E[Rt +1 + γV (St +1)|St = s]
Markov Decision Process
51. Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Return
Value function
Bellman Equation
Example: Bellman Equation for Student MRP
Markov Decision Process
52. Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Return
Value function
Bellman Equation
Bellman Equation in Matrix Form
The Bellman equation can be expressed concisely using matrices,
v = R + γP v
where v is a column vector with one entry per state
v(n)
=
v(1) R
. .
+ γ
P ... P
1 11 1n
R n ...
Pn1 Pnn
v(1)
.
. . .
v(n)
Markov Decision Process
53. Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Markov Decision Process
Value Function
Bellman Expectation Equation
Markov Decision Process
A Markov decision process (MDP) is a Markov reward process with
decisions. It is an environment in which all states are Markov.
Definition
A Markov Reward Process is a tuple < S ,A ,P ,R ,γ >
S is a (finite) set of states
A is a finite set of actions
P is a state transition probability matrix,
Pa ′
ss t +1 t t
' = P[S = s |S = s,A = a]
a
s
a
t +1
R is a reward function, R = E[R |St = s,A = a]
γ is a discount factor, γ ∈ [0;1]
Markov Decision Process
54. Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Example: Student MDP
Markov Decision Process
Markov Decision Process
Value Function
Bellman Expectation Equation
55. Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Value Function
Definition
The state-value function vπ(s) of an MDP is the expected return
starting from state s, and then following policy
vπ(s) = Eπ[Gt |St = s]
Definition
The action-value function qπ(s,a) is the expected return starting
from state s, taking action a, and then following policy π
qπ(s,a) = Eπ[Gt |St = s,At = a]
Markov Decision Process
Markov Decision Process
Value Function
Bellman Expectation Equation
56. Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Example: State-Value Function for Student MDP
Markov Decision Process
Markov Decision Process
Value Function
Bellman Expectation Equation
57. Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Bellman Expectation Equation
The state-value function can again be decomposed into immediate
reward plus discounted value of successor state,
vπ(s) = Eπ[Rt +1 + γvπ(st +1)|St = s]
The action-value function can similarly be decomposed,
qπ(s,a) = Eπ[Rt+1 + γqπ(St+1, At+1)|St = s,At = a]
Markov Decision Process
Markov Decision Process
Value Function
Bellman Expectation Equation
58. Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Example: Bellman Expectation Equation for Student MDP
Markov Decision Process
Markov Decision Process
Value Function
Bellman Expectation Equation
59. Conclusion
Reinforcement Learning is one of the most interesting and useful
parts of Machine learning.
In RL, the agent explores the environment without any human
intervention. It is the main learning algorithm that is used in
Artificial Intelligence.
The main issue with the RL algorithm is that some of the
parameters may affect the speed of the learning, such as delayed
feedback.