SlideShare a Scribd company logo
Reinforcement Learning
Lecture 1: Introduction
Hado van Hasselt
Senior Staff Research Scientist, DeepMind
Reinforcement learning
2021
Lecturers
Diana Borsa
Matteo Hessel
Background material
Background material
Reinforcement Learning: An Introduction, Sutton & Barto 2018
http://incompleteideas.net/book/the-book-2nd.html
Admin for UCL students
I Check Moodle for updates
I Use Moodle for questions
I Grading: assignments
About this course
What is reinforcement learning?
Artificial Intelligence
Motivation
I First, automation of repeated physical solutions
I Industrial revolution (1750 - 1850) and Machine Age (1870 - 1940)
I Second, automation of repeated mental solutions
I Digital revolution (1950 - now) and Information Age
I Next step: allow machines to find solutions themselves
I Artificial Intelligence
I We then only needs to specify a problem and/or goal
I This requires learning autonomously how to make decisions
Can machines think?
– Alan Turing, 1950
In the process of trying to imitate an adult human mind we are bound to think a good deal about
the process which has brought it to the state that it is in. We may notice three components,
a. The initial state of the mind, say at birth,
b. The education to which it has been subjected,
c. Other experience, not to be described as education, to which it has been subjected.
Instead of trying to produce a programme to simulate the adult mind, why not rather try to produce
one which simulates the child’s? If this were then subjected to an appropriate course of education one
would obtain the adult brain. Presumably the child-brain is something like a note-book as one buys
it from the stationers. Rather little mechanism, and lots of blank sheets. (Mechanism and writing
are from our point of view almost synonymous.) Our hope is that there is so little mechanism in
the child-brain that something like it can be easily programmed.
– Alan Turing, 1950
What is artificial intelligence?
I We will use the following definition of intelligence:
To be able to learn to make decisions to achieve goals
I Learning, decisions, and goals are all central
What is Reinforcement Learning?
What is reinforcement learning?
I People and animals learn by interacting with our environment
I This differs from certain other types of learning
I It is active rather than passive
I Interactions are often sequential — future interactions can depend on earlier ones
I We are goal-directed
I We can learn without examples of optimal behaviour
I Instead, we optimise some reward signal
The interaction loop
Goal: optimise sum of rewards, through repeated interaction
The reward hypothesis
Reinforcement learning is based on the reward hypothesis:
Any goal can be formalized as the outcome of maximizing a cumulative reward
Examples of RL problems
I Fly a helicopter
I Manage an investment portfolio
I Control a power station
I Make a robot walk
I Play video or board games
→ Reward: air time, inverse distance, ...
→ Reward: gains, gains minus risk, ...
→ Reward: efficiency, ...
→ Reward: distance, speed, ...
→ Reward: win, maximise score, ...
If the goal is to learn via interaction, these are all reinforcement learning problems
(Irrespective of which solution you use)
What is reinforcement learning?
There are distinct reasons to learn:
1. Find solutions
I A program that plays chess really well
I A manufacturing robot with a specific purpose
2. Adapt online, deal with unforeseen circumstances
I A chess program that can learn to adapt to you
I A robot that can learn to navigate unknown terrains
I Reinforcement learning can provide algorithms for both cases
I Note that the second point is not (just) about generalization — it is about continuing to
learn efficiently online, during operation
What is reinforcement learning?
I Science and framework of learning to make decisions from interaction
I This requires us to think about
I ...time
I ...(long-term) consequences of actions
I ...actively gathering experience
I ...predicting the future
I ...dealing with uncertainty
I Huge potential scope
I A formalisation of the AI problem
Example: Atari
Formalising the RL Problem
Agent and Environment
I At each step t the agent:
I Receives observation Ot (and reward Rt)
I Executes action At
I The environment:
I Receives action At
I Emits observation Ot+1 (and reward Rt+1)
Rewards
I A reward Rt is a scalar feedback signal
I Indicates how well agent is doing at step t — defines the goal
I The agent’s job is to maximize cumulative reward
Gt = Rt+1 + Rt+2 + Rt+3 + ...
I We call this the return
Reinforcement learning is based on the reward hypothesis:
Any goal can be formalized as the outcome of maximizing a cumulative reward
Values
I We call the expected cumulative reward, from a state s, the value
v(s) = E [Gt | St = s]
= E [Rt+1 + Rt+2 + Rt+3 + ... | St = s]
I The value depends on the actions the agent takes
I Goal is to maximize value, by picking suitable actions
I Rewards and values define utility of states and action (no supervised feedback)
I Returns and values can be defined recursively
Gt = Rt+1 + Gt+1
v(s) = E [Rt+1 + v(St+1) | St = s]
Maximising value by taking actions
I Goal: select actions to maximise value
I Actions may have long term consequences
I Reward may be delayed
I It may be better to sacrifice immediate reward to gain more long-term reward
I Examples:
I Refueling a helicopter (might prevent a crash in several hours)
I Defensive moves in a game (may help chances of winning later)
I Learning a new skill (can be costly & time-consuming at first)
I A mapping from states to actions is called a policy
Action values
I It is also possible to condition the value on actions:
q(s, a) = E [Gt | St = s, At = a]
= E [Rt+1 + Rt+2 + Rt+3 + ... | St = s, At = a]
I We will talk in depth about state and action values later
Core concepts
The reinforcement learning formalism includes
I Environment (dynamics of the problem)
I Reward signal (specifies the goal)
I Agent, containing:
I Agent state
I Policy
I Value function estimate?
I Model?
I We will now go into the agent
Inside the Agent: the Agent State
Agent components
Agent components
I Agent state
I Policy
I Value functions
I Model
Environment State
I The environment state is the
environment’s internal state
I It is usually invisible to the agent
I Even if it is visible, it may contain lots of
irrelevant information
Agent State
I The history is the full sequence of observations, actions, rewards
Ht = O0, A0, R1, O1, ..., Ot−1, At−1, Rt, Ot
I For instance, the sensorimotor stream of a robot
I This history is used to construct the agent state St
Fully Observable Environments
Full observability
Suppose the agent sees the full environment state
I observation = environment state
I The agent state could just be this observation:
St = Ot = environment state
Markov decision processes
Markov decision processes (MDPs) are a useful mathematical framework
Definition
A decision process is Markov if
p (r, s | St, At) = p (r, s | Ht, At)
I This means that the state contains all we need to know from the history
I Doesn’t mean it contains everything, just that adding more history doesn’t help
I =⇒ Once the state is known, the history may be thrown away
I The full environment + agent state is Markov (but large)
I The full history Ht is Markov (but keeps growing)
I Typically, the agent state St is some compression of Ht
I Note: we use St to denote the agent state, not the environment state
Partially Observable Environments
I Partial observability: The observations are not Markovian
I A robot with camera vision isn’t told its absolute location
I A poker playing agent only observes public cards
I Now using the observation as state would not be Markovian
I This is called a partially observable Markov decision process (POMDP)
I The environment state can still be Markov, but the agent does not know it
I We might still be able to construct a Markov agent state
Agent State
I The agent’s actions depend on its state
I The agent state is a function of the history
I For instance, St = Ot
I More generally:
St+1 = u(St, At, Rt+1, Ot+1)
where u is a ‘state update function‘
I The agent state is often much smaller than the
environment state
Agent State
The full environment state of a maze
Agent State
A potential observation
Agent State
An observation in a different location
Agent State
The two observations are indistinguishable
Agent State
These two states are not Markov
How could you construct a Markov agent state in this maze (for any reward signal)?
Partially Observable Environments
I To deal with partial observability, agent can construct suitable state representations
I Examples of agent states:
I Last observation: St = Ot (might not be enough)
I Complete history: St = Ht (might be too large)
I A generic update: St = u(St−1, At−1, Rt, Ot) (but how to pick/learn u?)
I Constructing a fully Markovian agent state is often not feasible
I More importantly, the state should allow good policies and value predictions
Inside the Agent: the Policy
Agent components
Agent components
I Agent state
I Policy
I Value function
I Model
Policy
I A policy defines the agent’s behaviour
I It is a map from agent state to action
I Deterministic policy: A = π(S)
I Stochastic policy: π(A|S) = p (A|S)
Inside the Agent: Value Estimates
Agent components
Agent components
I Agent state
I Policy
I Value function
I Model
Value Function
I The actual value function is the expected return
vπ (s) = E [Gt | St = s, π]
= E

Rt+1 + γRt+2 + γ2
Rt+3 + ... | St = s, π

I We introduced a discount factor γ ∈ [0, 1]
I Trades off importance of immediate vs long-term rewards
I The value depends on a policy
I Can be used to evaluate the desirability of states
I Can be used to select between actions
Value Functions
I The return has a recursive form Gt = Rt+1 + γGt+1
I Therefore, the value has as well
vπ (s) = E [Rt+1 + γGt+1 | St = s, At ∼ π(s)]
= E [Rt+1 + γvπ (St+1) | St = s, At ∼ π(s)]
Here a ∼ π(s) means a is chosen by policy π in state s (even if π is deterministic)
I This is known as a Bellman equation (Bellman 1957)
I A similar equation holds for the optimal (=highest possible) value:
v∗(s) = max
a
E [Rt+1 + γv∗(St+1) | St = s, At = a]
This does not depend on a policy
I We heavily exploit such equalities, and use them to create algorithms
Value Function approximations
I Agents often approximate value functions
I We will discuss algorithms to learn these efficiently
I With an accurate value function, we can behave optimally
I With suitable approximations, we can behave well, even in intractably big domains
Inside the Agent: Models
Agent components
Agent components
I Agent state
I Policy
I Value function
I Model
Model
I A model predicts what the environment will do next
I E.g., P predicts the next state
P(s, a, s0
) ≈ p (St+1 = s0
| St = s, At = a)
I E.g., R predicts the next (immediate) reward
R(s, a) ≈ E [Rt+1 | St = s, At = a]
I A model does not immediately give us a good policy - we would still need to plan
I We could also consider stochastic (generative) models
An Example
Maze Example
Start
Goal
I Rewards: -1 per time-step
I Actions: N, E, S, W
I States: Agent’s location
Maze Example: Policy
Start
Goal
I Arrows represent policy π(s) for each state s
Maze Example: Value Function
-14 -13 -12 -11 -10 -9
-16 -15 -12 -8
-16 -17 -6 -7
-18 -19 -5
-24 -20 -4 -3
-23 -22 -21 -22 -2 -1
Start
Goal
I Numbers represent value vπ (s) of each state s
Maze Example: Model
-1 -1 -1 -1 -1 -1
-1 -1 -1 -1
-1 -1 -1
-1
-1 -1
-1 -1
Start
Goal
I Grid layout represents partial transition model Pa
ss0
I Numbers represent immediate reward Ra
ss0 from each state s (same for all a and s0 in this
case)
Agent Categories
Agent Categories
I Value Based
I No Policy (Implicit)
I Value Function
I Policy Based
I Policy
I No Value Function
I Actor Critic
I Policy
I Value Function
Agent Categories
I Model Free
I Policy and/or Value Function
I No Model
I Model Based
I Optionally Policy and/or Value Function
I Model
Subproblems of the RL Problem
Prediction and Control
I Prediction: evaluate the future (for a given policy)
I Control: optimise the future (find the best policy)
I These can be strongly related:
π∗(s) = argmax
π
vπ (s)
I If we could predict everything do we need anything else?
Learning and Planning
Two fundamental problems in reinforcement learning
I Learning:
I The environment is initially unknown
I The agent interacts with the environment
I Planning:
I A model of the environment is given (or learnt)
I The agent plans in this model (without external interaction)
I a.k.a. reasoning, pondering, thought, search, planning
Learning Agent Components
I All components are functions
I Policies: π : S → A (or to probabilities over A)
I Value functions: v : S → R
I Models: m : S → S and/or r : S → R
I State update: u : S × O → S
I E.g., we can use neural networks, and use deep learning techniques to learn
I Take care: we do often violate assumptions from supervised learning (iid, stationarity)
I Deep learning is an important tool
I Deep reinforcement learning is a rich and active research field
Examples
Atari Example: Reinforcement Learning
observation
reward
action
at
rt
ot
I Rules of the game are unknown
I Learn directly from interactive
game-play
I Pick actions on joystick, see
pixels and scores
Gridworld Example: Prediction
3.3 8.8 4.4 5.3 1.5
1.5 3.0 2.3 1.9 0.5
0.1 0.7 0.7 0.4 -0.4
-1.0 -0.4 -0.4 -0.6 -1.2
-1.9 -1.3 -1.2 -1.4 -2.0
A B
A’
B’
+
10
+5
Actions
(a) (b)
Reward is −1 when bumping into a wall, γ = 0.9
What is the value function for the uniform random policy?
Gridworld Example: Control
a) gridworld b) V* c) *
22.0 24.4 22.0 19.4 17.5
19.8 22.0 19.8 17.8 16.0
17.8 19.8 17.8 16.0 14.4
16.0 17.8 16.0 14.4 13.0
14.4 16.0 14.4 13.0 11.7
A B
A’
B’
+
10
+5
π
What is the optimal value function over all possible policies?
What is the optimal policy?
Course
I In this course, we discuss how to learn by interaction
I The focus is on understanding core principles and learning algorithms
Topics include
I Exploration, in bandits and in sequential problems
I Markov decision processes, and planning by dynamic programming
I Model-free prediction and control (e.g., Q-learning)
I Policy-gradient methods
I Deep reinforcement learning
I Integrating learning and planning
I ...
Example: Locomotion
End of Lecture

More Related Content

Similar to Lecture 1 - introduction.pdf

An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game Learning
Prabhu Kumar
 
Reinforcement Learning using OpenAI Gym
Reinforcement Learning using OpenAI GymReinforcement Learning using OpenAI Gym
Reinforcement Learning using OpenAI Gym
Muhammad Aleem Siddiqui
 
acai01-updated.ppt
acai01-updated.pptacai01-updated.ppt
acai01-updated.ppt
butest
 
lecture_21.pptx - PowerPoint Presentation
lecture_21.pptx - PowerPoint Presentationlecture_21.pptx - PowerPoint Presentation
lecture_21.pptx - PowerPoint Presentation
butest
 
Reinforcement learning for data-driven optimisation
Reinforcement learning for data-driven optimisationReinforcement learning for data-driven optimisation
Reinforcement learning for data-driven optimisation
Université de Liège (ULg)
 
Reinforcement learning
Reinforcement  learningReinforcement  learning
Reinforcement learning
SKS
 
RL_Dr.SNR Final ppt for Presentation 28.05.2021.pptx
RL_Dr.SNR Final ppt for Presentation 28.05.2021.pptxRL_Dr.SNR Final ppt for Presentation 28.05.2021.pptx
RL_Dr.SNR Final ppt for Presentation 28.05.2021.pptx
deeplearning6
 
Lecture 4 (1).pptx
Lecture 4 (1).pptxLecture 4 (1).pptx
Lecture 4 (1).pptx
SumairaRasool6
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement Learning
Natan Katz
 
reinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdfreinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdf
VaishnavGhadge1
 
Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning
Chandra Meena
 
AI_Ch2.pptx
AI_Ch2.pptxAI_Ch2.pptx
AI_Ch2.pptx
qwtadhsaber
 
Designing an AI that gains experience for absolute beginners
Designing an AI that gains experience for absolute beginnersDesigning an AI that gains experience for absolute beginners
Designing an AI that gains experience for absolute beginners
Tanzim Saqib
 
Artificial intelligence introduction
Artificial intelligence introductionArtificial intelligence introduction
Artificial intelligence introduction
melchismel
 
Unit 1.ppt
Unit 1.pptUnit 1.ppt
Unit 1.ppt
BaskarChelladurai
 
24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx
ManiMaran230751
 
CS3013 -MACHINE LEARNING.pptx
CS3013 -MACHINE LEARNING.pptxCS3013 -MACHINE LEARNING.pptx
CS3013 -MACHINE LEARNING.pptx
logesswarisrinivasan
 
Deep einforcement learning
Deep einforcement learningDeep einforcement learning
Deep einforcement learning
OswaldoAndrsOrdezBol
 
Reinforcement learning 7313
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313
Slideshare
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
SVijaylakshmi
 

Similar to Lecture 1 - introduction.pdf (20)

An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game Learning
 
Reinforcement Learning using OpenAI Gym
Reinforcement Learning using OpenAI GymReinforcement Learning using OpenAI Gym
Reinforcement Learning using OpenAI Gym
 
acai01-updated.ppt
acai01-updated.pptacai01-updated.ppt
acai01-updated.ppt
 
lecture_21.pptx - PowerPoint Presentation
lecture_21.pptx - PowerPoint Presentationlecture_21.pptx - PowerPoint Presentation
lecture_21.pptx - PowerPoint Presentation
 
Reinforcement learning for data-driven optimisation
Reinforcement learning for data-driven optimisationReinforcement learning for data-driven optimisation
Reinforcement learning for data-driven optimisation
 
Reinforcement learning
Reinforcement  learningReinforcement  learning
Reinforcement learning
 
RL_Dr.SNR Final ppt for Presentation 28.05.2021.pptx
RL_Dr.SNR Final ppt for Presentation 28.05.2021.pptxRL_Dr.SNR Final ppt for Presentation 28.05.2021.pptx
RL_Dr.SNR Final ppt for Presentation 28.05.2021.pptx
 
Lecture 4 (1).pptx
Lecture 4 (1).pptxLecture 4 (1).pptx
Lecture 4 (1).pptx
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement Learning
 
reinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdfreinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdf
 
Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning
 
AI_Ch2.pptx
AI_Ch2.pptxAI_Ch2.pptx
AI_Ch2.pptx
 
Designing an AI that gains experience for absolute beginners
Designing an AI that gains experience for absolute beginnersDesigning an AI that gains experience for absolute beginners
Designing an AI that gains experience for absolute beginners
 
Artificial intelligence introduction
Artificial intelligence introductionArtificial intelligence introduction
Artificial intelligence introduction
 
Unit 1.ppt
Unit 1.pptUnit 1.ppt
Unit 1.ppt
 
24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx
 
CS3013 -MACHINE LEARNING.pptx
CS3013 -MACHINE LEARNING.pptxCS3013 -MACHINE LEARNING.pptx
CS3013 -MACHINE LEARNING.pptx
 
Deep einforcement learning
Deep einforcement learningDeep einforcement learning
Deep einforcement learning
 
Reinforcement learning 7313
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 

Recently uploaded

一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
upoux
 
Transformers design and coooling methods
Transformers design and coooling methodsTransformers design and coooling methods
Transformers design and coooling methods
Roger Rozario
 
Null Bangalore | Pentesters Approach to AWS IAM
Null Bangalore | Pentesters Approach to AWS IAMNull Bangalore | Pentesters Approach to AWS IAM
Null Bangalore | Pentesters Approach to AWS IAM
Divyanshu
 
Digital Twins Computer Networking Paper Presentation.pptx
Digital Twins Computer Networking Paper Presentation.pptxDigital Twins Computer Networking Paper Presentation.pptx
Digital Twins Computer Networking Paper Presentation.pptx
aryanpankaj78
 
openshift technical overview - Flow of openshift containerisatoin
openshift technical overview - Flow of openshift containerisatoinopenshift technical overview - Flow of openshift containerisatoin
openshift technical overview - Flow of openshift containerisatoin
snaprevwdev
 
一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理
一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理
一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理
upoux
 
Applications of artificial Intelligence in Mechanical Engineering.pdf
Applications of artificial Intelligence in Mechanical Engineering.pdfApplications of artificial Intelligence in Mechanical Engineering.pdf
Applications of artificial Intelligence in Mechanical Engineering.pdf
Atif Razi
 
Object Oriented Analysis and Design - OOAD
Object Oriented Analysis and Design - OOADObject Oriented Analysis and Design - OOAD
Object Oriented Analysis and Design - OOAD
PreethaV16
 
This study Examines the Effectiveness of Talent Procurement through the Imple...
This study Examines the Effectiveness of Talent Procurement through the Imple...This study Examines the Effectiveness of Talent Procurement through the Imple...
This study Examines the Effectiveness of Talent Procurement through the Imple...
DharmaBanothu
 
Introduction to Computer Networks & OSI MODEL.ppt
Introduction to Computer Networks & OSI MODEL.pptIntroduction to Computer Networks & OSI MODEL.ppt
Introduction to Computer Networks & OSI MODEL.ppt
Dwarkadas J Sanghvi College of Engineering
 
Ericsson LTE Throughput Troubleshooting Techniques.ppt
Ericsson LTE Throughput Troubleshooting Techniques.pptEricsson LTE Throughput Troubleshooting Techniques.ppt
Ericsson LTE Throughput Troubleshooting Techniques.ppt
wafawafa52
 
SENTIMENT ANALYSIS ON PPT AND Project template_.pptx
SENTIMENT ANALYSIS ON PPT AND Project template_.pptxSENTIMENT ANALYSIS ON PPT AND Project template_.pptx
SENTIMENT ANALYSIS ON PPT AND Project template_.pptx
b0754201
 
一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理
uqyfuc
 
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELDEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
ijaia
 
Assistant Engineer (Chemical) Interview Questions.pdf
Assistant Engineer (Chemical) Interview Questions.pdfAssistant Engineer (Chemical) Interview Questions.pdf
Assistant Engineer (Chemical) Interview Questions.pdf
Seetal Daas
 
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
Gino153088
 
DESIGN AND MANUFACTURE OF CEILING BOARD USING SAWDUST AND WASTE CARTON MATERI...
DESIGN AND MANUFACTURE OF CEILING BOARD USING SAWDUST AND WASTE CARTON MATERI...DESIGN AND MANUFACTURE OF CEILING BOARD USING SAWDUST AND WASTE CARTON MATERI...
DESIGN AND MANUFACTURE OF CEILING BOARD USING SAWDUST AND WASTE CARTON MATERI...
OKORIE1
 
Call For Paper -3rd International Conference on Artificial Intelligence Advan...
Call For Paper -3rd International Conference on Artificial Intelligence Advan...Call For Paper -3rd International Conference on Artificial Intelligence Advan...
Call For Paper -3rd International Conference on Artificial Intelligence Advan...
ijseajournal
 
Zener Diode and its V-I Characteristics and Applications
Zener Diode and its V-I Characteristics and ApplicationsZener Diode and its V-I Characteristics and Applications
Zener Diode and its V-I Characteristics and Applications
Shiny Christobel
 
一比一原版(uoft毕业证书)加拿大多伦多大学毕业证如何办理
一比一原版(uoft毕业证书)加拿大多伦多大学毕业证如何办理一比一原版(uoft毕业证书)加拿大多伦多大学毕业证如何办理
一比一原版(uoft毕业证书)加拿大多伦多大学毕业证如何办理
sydezfe
 

Recently uploaded (20)

一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
 
Transformers design and coooling methods
Transformers design and coooling methodsTransformers design and coooling methods
Transformers design and coooling methods
 
Null Bangalore | Pentesters Approach to AWS IAM
Null Bangalore | Pentesters Approach to AWS IAMNull Bangalore | Pentesters Approach to AWS IAM
Null Bangalore | Pentesters Approach to AWS IAM
 
Digital Twins Computer Networking Paper Presentation.pptx
Digital Twins Computer Networking Paper Presentation.pptxDigital Twins Computer Networking Paper Presentation.pptx
Digital Twins Computer Networking Paper Presentation.pptx
 
openshift technical overview - Flow of openshift containerisatoin
openshift technical overview - Flow of openshift containerisatoinopenshift technical overview - Flow of openshift containerisatoin
openshift technical overview - Flow of openshift containerisatoin
 
一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理
一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理
一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理
 
Applications of artificial Intelligence in Mechanical Engineering.pdf
Applications of artificial Intelligence in Mechanical Engineering.pdfApplications of artificial Intelligence in Mechanical Engineering.pdf
Applications of artificial Intelligence in Mechanical Engineering.pdf
 
Object Oriented Analysis and Design - OOAD
Object Oriented Analysis and Design - OOADObject Oriented Analysis and Design - OOAD
Object Oriented Analysis and Design - OOAD
 
This study Examines the Effectiveness of Talent Procurement through the Imple...
This study Examines the Effectiveness of Talent Procurement through the Imple...This study Examines the Effectiveness of Talent Procurement through the Imple...
This study Examines the Effectiveness of Talent Procurement through the Imple...
 
Introduction to Computer Networks & OSI MODEL.ppt
Introduction to Computer Networks & OSI MODEL.pptIntroduction to Computer Networks & OSI MODEL.ppt
Introduction to Computer Networks & OSI MODEL.ppt
 
Ericsson LTE Throughput Troubleshooting Techniques.ppt
Ericsson LTE Throughput Troubleshooting Techniques.pptEricsson LTE Throughput Troubleshooting Techniques.ppt
Ericsson LTE Throughput Troubleshooting Techniques.ppt
 
SENTIMENT ANALYSIS ON PPT AND Project template_.pptx
SENTIMENT ANALYSIS ON PPT AND Project template_.pptxSENTIMENT ANALYSIS ON PPT AND Project template_.pptx
SENTIMENT ANALYSIS ON PPT AND Project template_.pptx
 
一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理
 
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELDEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
 
Assistant Engineer (Chemical) Interview Questions.pdf
Assistant Engineer (Chemical) Interview Questions.pdfAssistant Engineer (Chemical) Interview Questions.pdf
Assistant Engineer (Chemical) Interview Questions.pdf
 
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
 
DESIGN AND MANUFACTURE OF CEILING BOARD USING SAWDUST AND WASTE CARTON MATERI...
DESIGN AND MANUFACTURE OF CEILING BOARD USING SAWDUST AND WASTE CARTON MATERI...DESIGN AND MANUFACTURE OF CEILING BOARD USING SAWDUST AND WASTE CARTON MATERI...
DESIGN AND MANUFACTURE OF CEILING BOARD USING SAWDUST AND WASTE CARTON MATERI...
 
Call For Paper -3rd International Conference on Artificial Intelligence Advan...
Call For Paper -3rd International Conference on Artificial Intelligence Advan...Call For Paper -3rd International Conference on Artificial Intelligence Advan...
Call For Paper -3rd International Conference on Artificial Intelligence Advan...
 
Zener Diode and its V-I Characteristics and Applications
Zener Diode and its V-I Characteristics and ApplicationsZener Diode and its V-I Characteristics and Applications
Zener Diode and its V-I Characteristics and Applications
 
一比一原版(uoft毕业证书)加拿大多伦多大学毕业证如何办理
一比一原版(uoft毕业证书)加拿大多伦多大学毕业证如何办理一比一原版(uoft毕业证书)加拿大多伦多大学毕业证如何办理
一比一原版(uoft毕业证书)加拿大多伦多大学毕业证如何办理
 

Lecture 1 - introduction.pdf

  • 1. Reinforcement Learning Lecture 1: Introduction Hado van Hasselt Senior Staff Research Scientist, DeepMind Reinforcement learning 2021
  • 3. Background material Background material Reinforcement Learning: An Introduction, Sutton & Barto 2018 http://incompleteideas.net/book/the-book-2nd.html
  • 4. Admin for UCL students I Check Moodle for updates I Use Moodle for questions I Grading: assignments
  • 8. Motivation I First, automation of repeated physical solutions I Industrial revolution (1750 - 1850) and Machine Age (1870 - 1940) I Second, automation of repeated mental solutions I Digital revolution (1950 - now) and Information Age I Next step: allow machines to find solutions themselves I Artificial Intelligence I We then only needs to specify a problem and/or goal I This requires learning autonomously how to make decisions
  • 9. Can machines think? – Alan Turing, 1950
  • 10. In the process of trying to imitate an adult human mind we are bound to think a good deal about the process which has brought it to the state that it is in. We may notice three components, a. The initial state of the mind, say at birth, b. The education to which it has been subjected, c. Other experience, not to be described as education, to which it has been subjected. Instead of trying to produce a programme to simulate the adult mind, why not rather try to produce one which simulates the child’s? If this were then subjected to an appropriate course of education one would obtain the adult brain. Presumably the child-brain is something like a note-book as one buys it from the stationers. Rather little mechanism, and lots of blank sheets. (Mechanism and writing are from our point of view almost synonymous.) Our hope is that there is so little mechanism in the child-brain that something like it can be easily programmed. – Alan Turing, 1950
  • 11. What is artificial intelligence? I We will use the following definition of intelligence: To be able to learn to make decisions to achieve goals I Learning, decisions, and goals are all central
  • 13. What is reinforcement learning? I People and animals learn by interacting with our environment I This differs from certain other types of learning I It is active rather than passive I Interactions are often sequential — future interactions can depend on earlier ones I We are goal-directed I We can learn without examples of optimal behaviour I Instead, we optimise some reward signal
  • 14. The interaction loop Goal: optimise sum of rewards, through repeated interaction
  • 15. The reward hypothesis Reinforcement learning is based on the reward hypothesis: Any goal can be formalized as the outcome of maximizing a cumulative reward
  • 16. Examples of RL problems I Fly a helicopter I Manage an investment portfolio I Control a power station I Make a robot walk I Play video or board games → Reward: air time, inverse distance, ... → Reward: gains, gains minus risk, ... → Reward: efficiency, ... → Reward: distance, speed, ... → Reward: win, maximise score, ... If the goal is to learn via interaction, these are all reinforcement learning problems (Irrespective of which solution you use)
  • 17. What is reinforcement learning? There are distinct reasons to learn: 1. Find solutions I A program that plays chess really well I A manufacturing robot with a specific purpose 2. Adapt online, deal with unforeseen circumstances I A chess program that can learn to adapt to you I A robot that can learn to navigate unknown terrains I Reinforcement learning can provide algorithms for both cases I Note that the second point is not (just) about generalization — it is about continuing to learn efficiently online, during operation
  • 18. What is reinforcement learning? I Science and framework of learning to make decisions from interaction I This requires us to think about I ...time I ...(long-term) consequences of actions I ...actively gathering experience I ...predicting the future I ...dealing with uncertainty I Huge potential scope I A formalisation of the AI problem
  • 21. Agent and Environment I At each step t the agent: I Receives observation Ot (and reward Rt) I Executes action At I The environment: I Receives action At I Emits observation Ot+1 (and reward Rt+1)
  • 22. Rewards I A reward Rt is a scalar feedback signal I Indicates how well agent is doing at step t — defines the goal I The agent’s job is to maximize cumulative reward Gt = Rt+1 + Rt+2 + Rt+3 + ... I We call this the return Reinforcement learning is based on the reward hypothesis: Any goal can be formalized as the outcome of maximizing a cumulative reward
  • 23. Values I We call the expected cumulative reward, from a state s, the value v(s) = E [Gt | St = s] = E [Rt+1 + Rt+2 + Rt+3 + ... | St = s] I The value depends on the actions the agent takes I Goal is to maximize value, by picking suitable actions I Rewards and values define utility of states and action (no supervised feedback) I Returns and values can be defined recursively Gt = Rt+1 + Gt+1 v(s) = E [Rt+1 + v(St+1) | St = s]
  • 24. Maximising value by taking actions I Goal: select actions to maximise value I Actions may have long term consequences I Reward may be delayed I It may be better to sacrifice immediate reward to gain more long-term reward I Examples: I Refueling a helicopter (might prevent a crash in several hours) I Defensive moves in a game (may help chances of winning later) I Learning a new skill (can be costly & time-consuming at first) I A mapping from states to actions is called a policy
  • 25. Action values I It is also possible to condition the value on actions: q(s, a) = E [Gt | St = s, At = a] = E [Rt+1 + Rt+2 + Rt+3 + ... | St = s, At = a] I We will talk in depth about state and action values later
  • 26. Core concepts The reinforcement learning formalism includes I Environment (dynamics of the problem) I Reward signal (specifies the goal) I Agent, containing: I Agent state I Policy I Value function estimate? I Model? I We will now go into the agent
  • 27. Inside the Agent: the Agent State
  • 28. Agent components Agent components I Agent state I Policy I Value functions I Model
  • 29. Environment State I The environment state is the environment’s internal state I It is usually invisible to the agent I Even if it is visible, it may contain lots of irrelevant information
  • 30. Agent State I The history is the full sequence of observations, actions, rewards Ht = O0, A0, R1, O1, ..., Ot−1, At−1, Rt, Ot I For instance, the sensorimotor stream of a robot I This history is used to construct the agent state St
  • 31. Fully Observable Environments Full observability Suppose the agent sees the full environment state I observation = environment state I The agent state could just be this observation: St = Ot = environment state
  • 32. Markov decision processes Markov decision processes (MDPs) are a useful mathematical framework Definition A decision process is Markov if p (r, s | St, At) = p (r, s | Ht, At) I This means that the state contains all we need to know from the history I Doesn’t mean it contains everything, just that adding more history doesn’t help I =⇒ Once the state is known, the history may be thrown away I The full environment + agent state is Markov (but large) I The full history Ht is Markov (but keeps growing) I Typically, the agent state St is some compression of Ht I Note: we use St to denote the agent state, not the environment state
  • 33. Partially Observable Environments I Partial observability: The observations are not Markovian I A robot with camera vision isn’t told its absolute location I A poker playing agent only observes public cards I Now using the observation as state would not be Markovian I This is called a partially observable Markov decision process (POMDP) I The environment state can still be Markov, but the agent does not know it I We might still be able to construct a Markov agent state
  • 34. Agent State I The agent’s actions depend on its state I The agent state is a function of the history I For instance, St = Ot I More generally: St+1 = u(St, At, Rt+1, Ot+1) where u is a ‘state update function‘ I The agent state is often much smaller than the environment state
  • 35. Agent State The full environment state of a maze
  • 36. Agent State A potential observation
  • 37. Agent State An observation in a different location
  • 38. Agent State The two observations are indistinguishable
  • 39. Agent State These two states are not Markov How could you construct a Markov agent state in this maze (for any reward signal)?
  • 40. Partially Observable Environments I To deal with partial observability, agent can construct suitable state representations I Examples of agent states: I Last observation: St = Ot (might not be enough) I Complete history: St = Ht (might be too large) I A generic update: St = u(St−1, At−1, Rt, Ot) (but how to pick/learn u?) I Constructing a fully Markovian agent state is often not feasible I More importantly, the state should allow good policies and value predictions
  • 41. Inside the Agent: the Policy
  • 42. Agent components Agent components I Agent state I Policy I Value function I Model
  • 43. Policy I A policy defines the agent’s behaviour I It is a map from agent state to action I Deterministic policy: A = π(S) I Stochastic policy: π(A|S) = p (A|S)
  • 44. Inside the Agent: Value Estimates
  • 45. Agent components Agent components I Agent state I Policy I Value function I Model
  • 46. Value Function I The actual value function is the expected return vπ (s) = E [Gt | St = s, π] = E Rt+1 + γRt+2 + γ2 Rt+3 + ... | St = s, π I We introduced a discount factor γ ∈ [0, 1] I Trades off importance of immediate vs long-term rewards I The value depends on a policy I Can be used to evaluate the desirability of states I Can be used to select between actions
  • 47. Value Functions I The return has a recursive form Gt = Rt+1 + γGt+1 I Therefore, the value has as well vπ (s) = E [Rt+1 + γGt+1 | St = s, At ∼ π(s)] = E [Rt+1 + γvπ (St+1) | St = s, At ∼ π(s)] Here a ∼ π(s) means a is chosen by policy π in state s (even if π is deterministic) I This is known as a Bellman equation (Bellman 1957) I A similar equation holds for the optimal (=highest possible) value: v∗(s) = max a E [Rt+1 + γv∗(St+1) | St = s, At = a] This does not depend on a policy I We heavily exploit such equalities, and use them to create algorithms
  • 48. Value Function approximations I Agents often approximate value functions I We will discuss algorithms to learn these efficiently I With an accurate value function, we can behave optimally I With suitable approximations, we can behave well, even in intractably big domains
  • 50. Agent components Agent components I Agent state I Policy I Value function I Model
  • 51. Model I A model predicts what the environment will do next I E.g., P predicts the next state P(s, a, s0 ) ≈ p (St+1 = s0 | St = s, At = a) I E.g., R predicts the next (immediate) reward R(s, a) ≈ E [Rt+1 | St = s, At = a] I A model does not immediately give us a good policy - we would still need to plan I We could also consider stochastic (generative) models
  • 53. Maze Example Start Goal I Rewards: -1 per time-step I Actions: N, E, S, W I States: Agent’s location
  • 54. Maze Example: Policy Start Goal I Arrows represent policy π(s) for each state s
  • 55. Maze Example: Value Function -14 -13 -12 -11 -10 -9 -16 -15 -12 -8 -16 -17 -6 -7 -18 -19 -5 -24 -20 -4 -3 -23 -22 -21 -22 -2 -1 Start Goal I Numbers represent value vπ (s) of each state s
  • 56. Maze Example: Model -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 Start Goal I Grid layout represents partial transition model Pa ss0 I Numbers represent immediate reward Ra ss0 from each state s (same for all a and s0 in this case)
  • 58. Agent Categories I Value Based I No Policy (Implicit) I Value Function I Policy Based I Policy I No Value Function I Actor Critic I Policy I Value Function
  • 59. Agent Categories I Model Free I Policy and/or Value Function I No Model I Model Based I Optionally Policy and/or Value Function I Model
  • 60. Subproblems of the RL Problem
  • 61. Prediction and Control I Prediction: evaluate the future (for a given policy) I Control: optimise the future (find the best policy) I These can be strongly related: π∗(s) = argmax π vπ (s) I If we could predict everything do we need anything else?
  • 62. Learning and Planning Two fundamental problems in reinforcement learning I Learning: I The environment is initially unknown I The agent interacts with the environment I Planning: I A model of the environment is given (or learnt) I The agent plans in this model (without external interaction) I a.k.a. reasoning, pondering, thought, search, planning
  • 63. Learning Agent Components I All components are functions I Policies: π : S → A (or to probabilities over A) I Value functions: v : S → R I Models: m : S → S and/or r : S → R I State update: u : S × O → S I E.g., we can use neural networks, and use deep learning techniques to learn I Take care: we do often violate assumptions from supervised learning (iid, stationarity) I Deep learning is an important tool I Deep reinforcement learning is a rich and active research field
  • 65. Atari Example: Reinforcement Learning observation reward action at rt ot I Rules of the game are unknown I Learn directly from interactive game-play I Pick actions on joystick, see pixels and scores
  • 66. Gridworld Example: Prediction 3.3 8.8 4.4 5.3 1.5 1.5 3.0 2.3 1.9 0.5 0.1 0.7 0.7 0.4 -0.4 -1.0 -0.4 -0.4 -0.6 -1.2 -1.9 -1.3 -1.2 -1.4 -2.0 A B A’ B’ + 10 +5 Actions (a) (b) Reward is −1 when bumping into a wall, γ = 0.9 What is the value function for the uniform random policy?
  • 67. Gridworld Example: Control a) gridworld b) V* c) * 22.0 24.4 22.0 19.4 17.5 19.8 22.0 19.8 17.8 16.0 17.8 19.8 17.8 16.0 14.4 16.0 17.8 16.0 14.4 13.0 14.4 16.0 14.4 13.0 11.7 A B A’ B’ + 10 +5 π What is the optimal value function over all possible policies? What is the optimal policy?
  • 68. Course I In this course, we discuss how to learn by interaction I The focus is on understanding core principles and learning algorithms Topics include I Exploration, in bandits and in sequential problems I Markov decision processes, and planning by dynamic programming I Model-free prediction and control (e.g., Q-learning) I Policy-gradient methods I Deep reinforcement learning I Integrating learning and planning I ...