Introduction to Machine
Learning
Lecture 21
Reinforcement Learning
Albert Orriols i Puig
http://www.albertorriols.net
htt // lb t i l t
aorriols@salle.url.edu
Artificial Intelligence – Machine Learning
g g
Enginyeria i Arquitectura La Salle
Universitat Ramon Llull
Recap of Lectures 5-18
Supervised learning
p g
Data classification
Labeled data
Build a model that
covers all the space
Unsupervised learning
Clustering
Unlabeled data
Group similar objects
G i il bj t
Association rule analysis
Unlabeled data
Get the most frequent/important associations
Genetic Fuzzy Systems
Slide 2
Artificial Intelligence Machine Learning
Today’s Agenda
Introduction
Reinforcement Learning
Some examples before going farther
Slide 3
Artificial Intelligence Machine Learning
Introduction
What does reinforcement learning aim at?
g
Learning from interaction (with environment)
Goal-directed learning
GOAL
State
Environment
Environment
Action
Agent
agent
Learning what to do and its effect
Trial-and-error search and delayed reward
Slide 4
Artificial Intelligence Machine Learning
Introduction
Learn a reactive behaviors
Behaviors as a mapping between perceptions and actions
The
Th agent has to exploit what it already knows in order to
th t l it h t l dk i dt
obtain reward, but it also has to explore in order to make
better action selections in the future.
Dilemma − neither exploitation nor exploration can be
e a e t e e p o tat o o e p o at o ca
pursued exclusively without failing at the task.
Slide 5
Artificial Intelligence Machine Learning
How Can We Learn It?
Look-up tables
p Rules
1. 3.
Perception Action
State 1 Action 1
State 2 Action 2
State 3 Action 3
… …
Neural Net orks
Ne ral Networks Finite t
Fi it automata
t
2. 4.
Slide 6
Artificial Intelligence Machine Learning
Reinforcement Learning
Reward function
Agent
r:S → R
State Action
or
Reward
st at
r:S×A→ R
rt
Environment
Agent and environment interact at discrete time steps t=0,1,2, …
The agent
g
observes state at step t: st ε S
produces action at at step t: at ε A(st)
gets resulting reward: rt+1 ε R
goes to the next step st+1
Slide 8
Artificial Intelligence Machine Learning
Reinforcement Learning
Agent
State Action
Reward
st at
rt
Environment
Trace of a trial
…r …
at rt+1 at+1 rt+2 at+2 rt+3 at+3
t
st st+1 st+2 st+3
Agent goal:
Maximize the total amount of reward t receives
Therefore, that means maximizing not only the immediate reward,
but cumulative reward in the long run
Slide 9
Artificial Intelligence Machine Learning
Example of RL
Example: Recycling robot
State
charge level of battery
Actions
look for cans, wait for can, go recharge
Reward
R d
positive for finding cans, negative for running out of battery
Slide 10
Artificial Intelligence Machine Learning
More precisely…
Restricting to Markovian Decision Process (MDP)
g ( )
Finite set of situations
Finite t f ti
Fi it set of actions
Transition probabilities
Reward probabilities
This means that
The agent needs to have complete information of the world
State st+1 only depends on state st and action at
Slide 11
Artificial Intelligence Machine Learning
Recycling Robot Example
1 − β , −3 β , R search
wait
1, R
wait search
recharge
1, 0
High
g Low
search wait
α ,R 1 − α ,R
search wait
search
1R
1,
Slide 12
Artificial Intelligence Machine Learning
Recycling Robot Example
S = {high, low}
g
A (high) = {wait, search}
A (low ) = {wait, search, recharge}
R search : expected # cans while searching
R wait : expected # cans while waiting
R search > R wait
Slide 13
Artificial Intelligence Machine Learning
Breaking the Markovian Property
Possible problems that do not satisfy MDP
p y
When action and states are not finite
Solution: Discretize the set of actions and states
When transition probabilities do not depend only on the current
state
Possible solution: represent states as structures build up
over time from sequences of sensations
q
This is POMDP Partial observable MDP
Use POMDP algorithms to solve these problems
g
Slide 14
Artificial Intelligence Machine Learning
Elements of Reinforcement Learning
Slide 15
Artificial Intelligence Machine Learning
Elements of RL
Policy: what to do
Reward: what’s good
Value: What’s good because it p ed cts reward
a ue at s t predicts e a d
Model: What follows what
Slide 16
Artificial Intelligence Machine Learning
Components of an RL Agent
Policy (behavior)
Mapping from states to actions
π*: S A
Reward
Local reward in state t:
rt
Model
Probability of transition from state s to s’ by executing action a
s
T(s,a,s’)
And
The transitions probabilities depend only on these parameters
This is not known by the agent
Slide 17
Artificial Intelligence Machine Learning
Components of an RL Agent
Value functions
Vπ(s): Long-term reward estimation from state s following policy
π
Qπ(s,a): Long-term reward estimation from state s executing
ac o
action a and then following po cy π
ad e oo g policy
A simple example
A maze
Note t at t e age t does not know its o
ote that the agent ot o ts own pos t o It ca o y
position. t can only
perceive what it has in the surrounding states
Slide 18
Artificial Intelligence Machine Learning
Components of an RL Agent
Value functions
Vπ(s): Long-term reward estimation from state s following policy
π
Qπ(s,a): Long-term reward estimation from state s executing
ac o
action a and then following po cy π
ad e oo g policy
A simple example
A maze
Note t at t e age t does not know its o
ote that the agent ot o ts own pos t o It ca o y
position. t can only
perceive what it has in the surrounding states
Slide 19
Artificial Intelligence Machine Learning
Pursuing the goal: Maximize long term reward
Slide 20
Artificial Intelligence Machine Learning
Goals and Rewards
Ok, but I need to maximize my long term reward. How I
, y g
get the long term reward?
Long term reward defined in terms of the goal of the agent
The agent receives the local reward at each time step
How?
Intuitive idea: Sum all the rewards obtained so far
Problem: It can increase heavily in non-ending tasks
Slide 21
Artificial Intelligence Machine Learning
Goals and Rewards
How can we deal with non-ending tasks?
g
Weighted addition of local rewards
The γ parameter (0 < γ < 1) is the discounting factor
e pa a ete ) s t e d scou t g acto
…r …
at rt+1 at+1 rt+2 at+2 rt+3 at+3
t
st st+1 st+2 st+3
Note t e b as for immediate rewards
ote the bias o ed ate e a ds
If you want to avoid it, set γ close to 1
Slide 22
Artificial Intelligence Machine Learning
Some examples
Slide 23
Artificial Intelligence Machine Learning
Pole balancing
Balance the pole
p
The car can move forward
a d backward
and bac a d
Avoid failure:
the pole falling beyond
a certain critical angle
the car hitting the end of the track
g
Reward
-1 upon failure
-ak, for k steps before failure
a
Slide 24
Artificial Intelligence Machine Learning
Mountain Car Problem
Objective
j
Get to the top of the hill as
qu c y
quickly as poss b e
possible
State d fi iti
St t definition:
Car position and speed
Actions
Forward, reverse, none
Reward
-1 for each step that are not the on the top of the hill
-number of steps before reaching the top of the hill
Slide 25
Artificial Intelligence Machine Learning
Next Class
How t l
H to learn th policies
the li i
Slide 26
Artificial Intelligence Machine Learning
Introduction to Machine
Learning
Lecture 21
Reinforcement Learning
Albert Orriols i Puig
http://www.albertorriols.net
htt // lb t i l t
aorriols@salle.url.edu
Artificial Intelligence – Machine Learning
g g
Enginyeria i Arquitectura La Salle
Universitat Ramon Llull
0 comments
Post a comment