NUS-ISS Learning Day 2019-Introduction to reinforcement learning

Machine Learning
Dilbert from https://dilbert.com/strip/2013-02-02

Introduction to
Reinforcement Learning
Lee Chuk Munn
Slides at http://bit.ly/ld2019_rl

Objective
Introduce Major
Concepts

What is Machine Learning?
“Machine learning is the field of
study that gives computers the
ability to learn without being
explicitly programmed”
Arthur Samuel

Supervised Learning
Generalization
- Give sample data and answer
to the data
- Infer rules from the sample
and answer
Goal - find known answers
apple
apple
carrot
broccoli
LabelsData
Its an Apple

Unsupervised Learning
Comparing
- Give sample data but not the
answer
- Use some measurement to
infer similarity by grouping
them
Goal - find unknown
patterns
Data

Feedback
- Do not give sample data or
answer
- Infer rules positive or
negative feedback
Goal - find optimal action
Environment

Wrote the first
program that learn
how to play
checkers in 1959
Uses the minimax
algorithm
Arthur Samuel

How to teach a machine to play Tic Tac Toe?
Images from http://users.sussex.ac.uk/~christ/crs/kr-ist/lec05a.html

How to teach a machine to play Tic Tac Toe?
Images from http://users.sussex.ac.uk/~christ/crs/kr-ist/lec05a.html
MAX
MIN
MAX

Model for Reinforcement Learning
ActionObserve
Agent
Environment
State Reward
S A
R

Central Idea of Any Autonomous Agent

Bellman Expectation Equation / Dynamic Programming
Probability of selecting an action Result of selecting an action

Grid World Robot select any action (NORTH,
EAST, SOUTH, WEST) with equal
probability
The episode ends if then robot lands
in either the red or or the green
square
Robot will get a -1 reward if it lands
on a red square
Robot will get a 5 reward if it lands on
a green square
Robot will get a 0.1 reward if it lands
on a black square

First Iteration - State Value
AB
0.78-0.45-0.45-0.45
t = 0
CD

Second Iteration - State Value
0.66-0.4-0.68-0.68
0.78-0.45-0.45-0.45
t = 0
t = 1

State Value
We know how good it is to be in a
particular state viz. how close we are
to the goal
But how do we know what is the best
action to take so that we can land in
the next best state0.66-0.4-0.68-0.68
0.78-0.45-0.45-0.45
t = 0
t = 1

Action Value
Pick the best action
0.66-0.4-0.68-0.68
t = 1

First Iteration - Action Values
1.250.03
-0.25
-0.25
0.030.03
-0.25
-0.25
0.030.03
-0.25
-0.25
0.030.03
-0.25
-0.25

Second Iteration - Action Values
1.25-0.09
-0.25
-0.25
0.22
-0.25
-0.25
-0.09-0.09
-0.25
-0.25
-0.09-0.09
-0.25
-0.25
-0.09
Policy
emerging
after 2
iterations

Action Value
No way to make decision with state value
Keep the value of each action separate so we can select the best action
Treat this as an array of terms. max
behaves like Python max() function

When to Use What
Deterministic - as sure
as the sunrise
Minimax
Stochastic - as sure as
the weather forecast
MDP

What happens when you get end up in
a totally alien environment?

Law of Large Numbers - Monte Carlo
When you have no knowledge of the MDP (model free)
Sample forever, the probabilities will converge to their
true values
What should you wear?

Monte Carlo
Generate an episode using your current agent’s policy
Average the return for the first occurrence of (state, action) pair
(D,E) = (0.1 + -1 + 0.1 + 0.1 + -1) / 2 = -0.85
(C,N) = -1 / 1 = -1
(C,E) = (0.1 + -1) / 1 = 0.9
(B,S) = -1 / 1 = -1
Use the slightly better policy to generate another episode
0.1 -1
(D,E) (C,N)
0.1 0.1 -1
(D,E) (C,E) (B,S)
ABCD

After 2 Rollouts
-0.85
-1
0.9
-1
ABCD

Monte Carlo Updates
Rollouts
S
t
return from here
Real reward
Monte Carlo

Temporal Difference Updates
S
t
return from here
Rollouts
Predict these
1 step
Temporal Difference
SARSA

Unified View of Reinforcement Learning

State/Action Value
Exact
- Dynamic Programming
Approximate
- Linear combination
- Neural net
- Etc.

Monte Carlo vs Temporal Difference Updates
Rollouts
Returns from here
Real reward
Monte Carlo
Rollouts
Returns from here
Predict these
1 step
Temporal Difference
SARSA

Incremental Mean
Does this
look
familiar?

Monte Carlo Update
Learning rate
State value of previous iteration
Slightly better state
value after update
Cumulative reward from n
to the end of episode
Episodic updates based on real returns

Temporal Difference TD(0)
Incremental updates based on bootstraps
One step reward plus predicted
reward from the next state

Difference Between MC and TD
Monte Carlo
Update policy at the end of
an episode
Works only in episodic
environment
High variance
Temporal Difference
Update policy a little after
every step
Works with incomplete
sequences/non-terminating
High bias

SARSA(0)
Incremental updates based on bootstraps
Uses action-values instead of
state value

NUS-ISS Learning Day 2019-Introduction to reinforcement learning

More Related Content

Similar to NUS-ISS Learning Day 2019-Introduction to reinforcement learning

More from NUS-ISS

Recently uploaded

NUS-ISS Learning Day 2019-Introduction to reinforcement learning