Reinforcement course material samples: lecture 1

Lecture 1:
What is Reinforcement Learning and How Should We Learn It?

Table of Contents
• What is reinforcement learning (RL)?
• Purpose and Specificity of This Course
• How we should learn RL and how this course is structured
• Introducing RL with an example of balancing a bike
• Wrapping up
2

What Reinforcement Learning Can Do?
Robotics
3
Gaming AI
Machine Self-Optimization
Financial Bots
Chat-Bot

Role of Reinforcement Learnig (RL) in AI
Machine learning
AI
Machine
learning
Classical
models
Neural
networks
Models How to train models
Supervised
learning
Unsupervised
learning
Reinforcement
learning
4

Review: Supervised Learning
Blackbox Model
• Approximation of real-world blackbox
• Supervision by disparity between predictions and labels (loss function)
5
Correct
Model
Incorrect
Correct
Supervision

Deep Learning: Model as Neural Networks
• NLP
Verctors → Vectors
A tensor → Vectors/tensors
Positive
Negative
Cat
Dog
Horse
Goat…
• Image processing
6

Unsupervised Learning
• Dimension reduction • Clustering
Handcrafted
rules
Handcrafted
rules
7

Review: Various Training Examples for Deep Learning
• Classification
Positive
Negative
Correct
label
• Regression (e.g. translation)
Correct
translation
• ChatGPT
No. 1
No. 2
No. 3
Giving rankings
to outputs
8

Review: Supervised or Unsupervised Training of ML Models
• Supervised/unsupervised learning framework
Data ML model
Optimization
Supervising data
Optimization with gradient descent
(to the direction where a loss
function decreases)
• How the ML model is optimized
9
Optimal function
Initialized function

Basic Ideas of ML Types
• Supervised learning
(approximating functions)
• Unsupervised learning
(finding structures in data with heuristic rules)
• Reinforcement learning (finding the best action in each state)
No
move
Lean
left
Lean
right
10

Differences of Three Major Training Methods
• Data: inputs and labels
• Objective: metrics such as accuracy
• Supervision: differences between
prediction and labels
• Directness: direct supervision
• Timings: immediate supervision
Supervised learning
• Data: only inputs
• Objective: some insights to humanas
• Supervision: hand-crafted loss
• Directness: indirect supervision
• Timings: immediate supervision
Unsupervised learning
• Data: an environment
• Objective: expected return
• Supervision: differences between
expectations and actual rewards
• Directness: indirect supervision
• Timings: indirect supervision after some steps
Reinforcement learning No move
Lean left
Lean right
11

Table of Contents
• Wrapping up
12

The Purpose and Specifity of This Course
Reaching deep RL as efficiently as possible at an implementation level
13
• We organized contents to emphasize that RL algorithms come from one
core idea (GPI: genralized policy iteration)
• We cover minimum contents deeply to prioritize having a big overview on
RL and reaching deep RL
• And we always tell you where you are now, and what is the limits of
scopes covered in each lecture or the whole course

Textbook and side reader
• The most famous, popular
• Notations in this lecture follow this book
• Available for free
• A lot of practical examples
• Not necessarily recommended to read
everything in the order of this book
14

*Topics Not Covered by This Course
• Precise mathematical derivation
• Eligibility traces
• Details of RL with function approximation
• Partially observable Markov decision process
15

Table of Contents
• Wrapping up
16

Tips for Making it Easier to Study RL in the Beginning
• In RL, ultimately you just want to optimize a policy, a probability of taking
an action only based on where you are.
• In the beginning, you study dynamic programming (DP), a kind of RL without trial
and errors, and later you in a sense approximate DP with trial and errors.
• Unlike typical supervised or unsupervised learning, two functions, a value and
policy, are interactively optimized in DP and RL.
• In the beginning, you work on disappointingly simple environments.
• Numbers of algorithms introduced in most textbooks come from one fundamental
ideas.
17

First of all: a Little Detailed Definition of RL
• Sequential decision making: optimizing a sequence of actions
• Policy: a function mapping a state to an action
• Markov decision processes: next action only depends on the current state
• Expected return: an expectation of a sum of rewards over time steps
Reinforcement learning is a sequential decision making problem by
optimizing a policy mainly in a Markov decision process such that an
expected return is maximaized.
Let’s see what this means by keeping
some important points in mind for now
18

Markov Decision Process and Environments
19
• Markov decision processes: next action only depends on the current state
How you reached the state does not matter

Sequential Decision Making by Optimizing a Policy
No
move
Lean
left
Lean
right
20
• Policy: a function mapping a state to an action (arrows in the figures)
• Sequential decision making: optimizing a sequence of actions

Expected Return
Policies are optimized so that an agent
reaches the goal as soon as possible
Policies are optimized so that an agent
doesn’t approach penalty areas 21
Cells have to be defined/explained
• Expected return: an expectation of a sum of rewards
over time steps
: positive reward
: negative reward
: unmovable

• In RL, ultimately you just want to optimize a policy, a probability of taking an
action only based on where you are.
• In the beginning, you study dynamic programming (DP), a kind of RL
without trial and errors, and later you in a sense approximate DP with trial
and errors.
ideas.
22

The main purpose lecture 2, 3
 Planning, dynamic
programming
(“RL” without trial and errors)
Agent
Environment
Action
Reward
Agent
 RL
(planning with trial and errors)
Processes of planning are
approximated with
“experiences” of the agent 23

Model-based or model-free
 Planning  Model-free RL
Model-based Model-free
 Model-based RL
24

• Unlike typical supervised or unsupervised learning, two functions, a value
and policy, are interactively optimized in DP and RL.
ideas.
25

The Core Idea through This Lecture
This part should be
emphasized more
Agent
Environment
Action
Reward
Value
Policy
26

Optimization in RL Training: Interactive Updates of Value and Policy
• Supervised/unsupervised learning
Data ML model
Optimization
Supervising
data
Policy evaluation
Policy improvement
Agent
Environment
Action
Reward
ML model
ML model
Policy
Value
• Reinforcement Learning
To the direction where a loss function
decreases
To the direction
where an expected
reward increases, but
in a zig-zag path
27

Value or Policy
Agent
Environment
Action
Reward
ML model
ML model
Policy
State value
Policy + value
Value
Agent
Environment
Action
Reward
ML model
Action value
The optimal value and policy
can be derived from
28

ideas.
29

Expressivity of environemtns
RL with tabular data RL with classical function approximation Deep RL
Agent
Environment
Action
Reward
Next state
Value
Tabular
data
Policy
Tabular
data
High expressivity
Low expressivity
Agent
Environment
Action
Reward
Next state
Value
Tabular
data
Policy
Tabular
data
Agent
Environment
Action
Reward
Next state
Value
Tabular
data
Policy
Tabular
data
30
Environments explanation

• In the beginning, you work disappointingly simple environments.
• Numbers of algorithms introduced in most textbooks come from one
fundamental ideas.
31

Generalized Policy Iteration (GPI)
Model
-based
Model
-free
Policy iteration
Policy iteration
Lecture 3
SARSA
Q-learning
Lecture 4
Dyna Q
Lecture 7
Tabular actor-critic method
with advantage function
Lecture 8
Policy + value
Value-based
Alpha Zero

Generalized Policy Iteration (GPI) Low expressivity
High expressivity

Course Schedule
1. What is Reinforcement Learning (RL) and how should we learn it?
2. Dynamic programming (DP), expression of Markov Decision Process
3. (Implementation exercise) DP with policy iteration, value iteration
4. TD-learning: introducing ”experience” and “trial and errors”
5. TD or Monte Carlo, Exploration or Exploitation
6. (Implementation exercise) Model-free RL with with Open AI Gym Format
7. Model-based RL and searching
8. Understanding RL so far as combinations of strategy mode settings
9. RL with function approximation: Approximation of Value Function
10. RL with function approximation: Approximation of Policy
11. (Practical implementation) Stock market environment from yfinance
12. (Practical implementation) Deep Q Networks with Video Game
13. (Buffer)
14. (Buffer)
Making environements and agents richer
”RL” without trial and errors
Agent
Value
Policy
Introducing “experiences” in
and ”trial and errors” in RL
Environment
Action
Reward
Agent
Value
Policy
Elaborating RL training
Value Policy
Model-free Model-based
Exploitation Exploration
34
Advanced topics,
implementations
Low expressivity High expressivity

Table of Contents
• Wrapping up
35

MDP with an Example of Balancing a Bike
Or
State 0
State 1
State 2
State 3
State 4
Leaning left
No move
Leaning right 36

Values and Policies: with an Example of Balancing a Bike
• Value: how good it is to be in a state
• Policy: a probability of taking an action in a state
State 0:
minus reward
State 1:
low value
State 2:
high value
State 3:
low value
State 4:
minus reward
Action 0:
Low probability
Action 2:
High probability
Action 1
37

Policy updates
• Higher probability on actions to the direction of high values
State 0:
minus reward
State 1:
low value
State 2:
high value
Action 0:
leaning left
Action 1:
leaning right
Then how can a value be learned?
Giving higher probability
38

Value update: How to Learn from „Experiences“
 Updating values by filling a gap between expectation and actual rewards
If you lean left, the
values is low. As expected!
TD loss is low
Leaning right would
not be good because
value is low.
I was wrong.
There is no bad reward.
Let’s update the value.
TD loss is high
Learning could happen without explicit rewards
39

Interactive Updates of Value and Policy
Value updates
(closing gaps
between expectation
and real rewards)
Policy updates
(taking actions
toward higher value)
Low-valued state
Medium-valued state
High-valued state
Medium-valued state
Low-valued state
40

Table of Contents
• Wrapping up
41

Wrapping Up
Let’s wrap things up by yourself this time
42

Practice problems
 Q1: Explain the terms below without using any mathematical notations
(Don’t use any mathematical notation. And you don’t need to use
mathematical definitions strictly)
 Q2: Explain what reinforcement learning is, using the terms above
A, Policy
B, Markov Decision Process
C, Expected reward
D, Value
43

Some of representative definitions of RL 2
• Wikipedia: “Reinforcement learning (RL) is an interdisciplinary area of
machine learning and optimal control concerned with how an intelligent agent
ought to take actions in a dynamic environment in order to maximize the
cumulative reward.”
• Burto and Sutton’s book: “Reinforcement learning is learning what to do—
how to map situations to actions—so as to maximize a numerical reward
signal.”
45

Some of representative definitions of RL 2
• ChatGPT 3.5 (16.11.2023)
46

Markov Decision Process (MDP) in some Expressions
Agent Env
Action
Reward
• Typical RL diagram
• State transition diagram • Backup diagram (closed)
• Graphical model
47

Reinforcement course material samples: lecture 1

Recommended

Recommended

More Related Content

Similar to Reinforcement course material samples: lecture 1

Similar to Reinforcement course material samples: lecture 1 (20)

More from YasutoTamura1

More from YasutoTamura1 (6)

Recently uploaded

Recently uploaded (20)

Reinforcement course material samples: lecture 1

Editor's Notes