2. Table of Contents
• What is reinforcement learning (RL)?
• Purpose and Specificity of This Course
• How we should learn RL and how this course is structured
• Introducing RL with an example of balancing a bike
• Wrapping up
2
4. Role of Reinforcement Learnig (RL) in AI
Machine learning
AI
Machine
learning
Classical
models
Neural
networks
Models How to train models
Supervised
learning
Unsupervised
learning
Reinforcement
learning
4
5. Review: Supervised Learning
Blackbox Model
• Approximation of real-world blackbox
• Supervision by disparity between predictions and labels (loss function)
5
Correct
Model
Incorrect
Correct
Supervision
6. Deep Learning: Model as Neural Networks
• NLP
Verctors → Vectors
A tensor → Vectors/tensors
Positive
Negative
Cat
Dog
Horse
Goat…
• Image processing
6
8. Review: Various Training Examples for Deep Learning
• Classification
Verctors → Vectors
Positive
Negative
Correct
label
• Regression (e.g. translation)
Verctors → Vectors
Correct
translation
• ChatGPT
Verctors → Vectors
No. 1
No. 2
No. 3
Giving rankings
to outputs
8
9. Review: Supervised or Unsupervised Training of ML Models
• Supervised/unsupervised learning framework
Data ML model
Optimization
Supervising data
Optimization with gradient descent
(to the direction where a loss
function decreases)
• How the ML model is optimized
9
Optimal function
Initialized function
10. Basic Ideas of ML Types
• Supervised learning
(approximating functions)
• Unsupervised learning
(finding structures in data with heuristic rules)
• Reinforcement learning (finding the best action in each state)
No
move
Lean
left
Lean
right
10
11. Differences of Three Major Training Methods
• Data: inputs and labels
• Objective: metrics such as accuracy
• Supervision: differences between
prediction and labels
• Directness: direct supervision
• Timings: immediate supervision
Supervised learning
• Data: only inputs
• Objective: some insights to humanas
• Supervision: hand-crafted loss
• Directness: indirect supervision
• Timings: immediate supervision
Unsupervised learning
• Data: an environment
• Objective: expected return
• Supervision: differences between
expectations and actual rewards
• Directness: indirect supervision
• Timings: indirect supervision after some steps
Reinforcement learning No move
Lean left
Lean right
11
12. Table of Contents
• What is reinforcement learning (RL)?
• Purpose and Specificity of This Course
• How we should learn RL and how this course is structured
• Introducing RL with an example of balancing a bike
• Wrapping up
12
13. The Purpose and Specifity of This Course
Reaching deep RL as efficiently as possible at an implementation level
13
• We organized contents to emphasize that RL algorithms come from one
core idea (GPI: genralized policy iteration)
• We cover minimum contents deeply to prioritize having a big overview on
RL and reaching deep RL
• And we always tell you where you are now, and what is the limits of
scopes covered in each lecture or the whole course
14. Textbook and side reader
• The most famous, popular
• Notations in this lecture follow this book
• Available for free
• A lot of practical examples
• Not necessarily recommended to read
everything in the order of this book
14
15. *Topics Not Covered by This Course
• Precise mathematical derivation
• Eligibility traces
• Details of RL with function approximation
• Partially observable Markov decision process
15
16. Table of Contents
• What is reinforcement learning (RL)?
• Purpose and Specificity of This Course
• How we should learn RL and how this course is structured
• Introducing RL with an example of balancing a bike
• Wrapping up
16
17. Tips for Making it Easier to Study RL in the Beginning
• In RL, ultimately you just want to optimize a policy, a probability of taking
an action only based on where you are.
• In the beginning, you study dynamic programming (DP), a kind of RL without trial
and errors, and later you in a sense approximate DP with trial and errors.
• Unlike typical supervised or unsupervised learning, two functions, a value and
policy, are interactively optimized in DP and RL.
• In the beginning, you work on disappointingly simple environments.
• Numbers of algorithms introduced in most textbooks come from one fundamental
ideas.
17
18. First of all: a Little Detailed Definition of RL
• Sequential decision making: optimizing a sequence of actions
• Policy: a function mapping a state to an action
• Markov decision processes: next action only depends on the current state
• Expected return: an expectation of a sum of rewards over time steps
Reinforcement learning is a sequential decision making problem by
optimizing a policy mainly in a Markov decision process such that an
expected return is maximaized.
Let’s see what this means by keeping
some important points in mind for now
18
19. Markov Decision Process and Environments
19
• Markov decision processes: next action only depends on the current state
How you reached the state does not matter
20. Sequential Decision Making by Optimizing a Policy
No
move
Lean
left
Lean
right
20
• Policy: a function mapping a state to an action (arrows in the figures)
• Sequential decision making: optimizing a sequence of actions
21. Expected Return
Policies are optimized so that an agent
reaches the goal as soon as possible
Policies are optimized so that an agent
doesn’t approach penalty areas 21
Cells have to be defined/explained
• Expected return: an expectation of a sum of rewards
over time steps
: positive reward
: negative reward
: unmovable
22. Tips for Making it Easier to Study RL in the Beginning
• In RL, ultimately you just want to optimize a policy, a probability of taking an
action only based on where you are.
• In the beginning, you study dynamic programming (DP), a kind of RL
without trial and errors, and later you in a sense approximate DP with trial
and errors.
• Unlike typical supervised or unsupervised learning, two functions, a value and
policy, are interactively optimized in DP and RL.
• In the beginning, you work on disappointingly simple environments.
• Numbers of algorithms introduced in most textbooks come from one fundamental
ideas.
22
23. The main purpose lecture 2, 3
Planning, dynamic
programming
(“RL” without trial and errors)
Agent
Environment
Action
Reward
Agent
RL
(planning with trial and errors)
Processes of planning are
approximated with
“experiences” of the agent 23
25. Tips for Making it Easier to Study RL in the Beginning
• In RL, ultimately you just want to optimize a policy, a probability of taking an
action only based on where you are.
• In the beginning, you study dynamic programming (DP), a kind of RL without trial
and errors, and later you in a sense approximate DP with trial and errors.
• Unlike typical supervised or unsupervised learning, two functions, a value
and policy, are interactively optimized in DP and RL.
• In the beginning, you work on disappointingly simple environments.
• Numbers of algorithms introduced in most textbooks come from one fundamental
ideas.
25
26. The Core Idea through This Lecture
This part should be
emphasized more
Agent
Environment
Action
Reward
Value
Policy
26
27. Optimization in RL Training: Interactive Updates of Value and Policy
• Supervised/unsupervised learning
Data ML model
Optimization
Supervising
data
Policy evaluation
Policy improvement
Agent
Environment
Action
Reward
ML model
ML model
Policy
Value
• Reinforcement Learning
To the direction where a loss function
decreases
To the direction
where an expected
reward increases, but
in a zig-zag path
27
28. Value or Policy
Agent
Environment
Action
Reward
ML model
ML model
Policy
State value
Policy + value
Value
Agent
Environment
Action
Reward
ML model
Action value
The optimal value and policy
can be derived from
28
29. Tips for Making it Easier to Study RL in the Beginning
• In RL, ultimately you just want to optimize a policy, a probability of taking an
action only based on where you are.
• In the beginning, you study dynamic programming (DP), a kind of RL without trial
and errors, and later you in a sense approximate DP with trial and errors.
• Unlike typical supervised or unsupervised learning, two functions, a value and
policy, are interactively optimized in DP and RL.
• In the beginning, you work on disappointingly simple environments.
• Numbers of algorithms introduced in most textbooks come from one fundamental
ideas.
29
30. Expressivity of environemtns
RL with tabular data RL with classical function approximation Deep RL
Agent
Environment
Action
Reward
Next state
Value
Tabular
data
Policy
Tabular
data
High expressivity
Low expressivity
Agent
Environment
Action
Reward
Next state
Value
Tabular
data
Policy
Tabular
data
Agent
Environment
Action
Reward
Next state
Value
Tabular
data
Policy
Tabular
data
30
Environments explanation
31. Tips for Making it Easier to Study RL in the Beginning
• In RL, ultimately you just want to optimize a policy, a probability of taking an
action only based on where you are.
• In the beginning, you study dynamic programming (DP), a kind of RL without trial
and errors, and later you in a sense approximate DP with trial and errors.
• Unlike typical supervised or unsupervised learning, two functions, a value and
policy, are interactively optimized in DP and RL.
• In the beginning, you work disappointingly simple environments.
• Numbers of algorithms introduced in most textbooks come from one
fundamental ideas.
31
32. Generalized Policy Iteration (GPI)
Model
-based
Model
-free
Policy iteration
Policy iteration
Lecture 3
SARSA
Q-learning
Lecture 4
Dyna Q
Lecture 7
Tabular actor-critic method
with advantage function
Lecture 8
Policy + value
Value-based
Alpha Zero
34. Course Schedule
1. What is Reinforcement Learning (RL) and how should we learn it?
2. Dynamic programming (DP), expression of Markov Decision Process
3. (Implementation exercise) DP with policy iteration, value iteration
4. TD-learning: introducing ”experience” and “trial and errors”
5. TD or Monte Carlo, Exploration or Exploitation
6. (Implementation exercise) Model-free RL with with Open AI Gym Format
7. Model-based RL and searching
8. Understanding RL so far as combinations of strategy mode settings
9. RL with function approximation: Approximation of Value Function
10. RL with function approximation: Approximation of Policy
11. (Practical implementation) Stock market environment from yfinance
12. (Practical implementation) Deep Q Networks with Video Game
13. (Buffer)
14. (Buffer)
Making environements and agents richer
”RL” without trial and errors
Agent
Value
Policy
Introducing “experiences” in
and ”trial and errors” in RL
Environment
Action
Reward
Agent
Value
Policy
Elaborating RL training
Value Policy
Model-free Model-based
Exploitation Exploration
34
Advanced topics,
implementations
Low expressivity High expressivity
35. Table of Contents
• What is reinforcement learning (RL)?
• Purpose and Specificity of This Course
• How we should learn RL and how this course is structured
• Introducing RL with an example of balancing a bike
• Wrapping up
35
36. MDP with an Example of Balancing a Bike
Or
State 0
State 1
State 2
State 3
State 4
Leaning left
No move
Leaning right 36
37. Values and Policies: with an Example of Balancing a Bike
• Value: how good it is to be in a state
• Policy: a probability of taking an action in a state
State 0:
minus reward
State 1:
low value
State 2:
high value
State 3:
low value
State 4:
minus reward
Action 0:
Low probability
Action 2:
High probability
Action 1
37
38. Policy updates
• Higher probability on actions to the direction of high values
State 0:
minus reward
State 1:
low value
State 2:
high value
Action 0:
leaning left
Action 1:
leaning right
Then how can a value be learned?
Giving higher probability
38
39. Value update: How to Learn from „Experiences“
Updating values by filling a gap between expectation and actual rewards
If you lean left, the
values is low. As expected!
TD loss is low
Leaning right would
not be good because
value is low.
I was wrong.
There is no bad reward.
Let’s update the value.
TD loss is high
Learning could happen without explicit rewards
39
40. Interactive Updates of Value and Policy
Value updates
(closing gaps
between expectation
and real rewards)
Policy updates
(taking actions
toward higher value)
Low-valued state
Medium-valued state
High-valued state
Medium-valued state
Low-valued state
40
41. Table of Contents
• What is reinforcement learning (RL)?
• Purpose and Specificity of This Course
• How we should learn RL and how this course is structured
• Introducing RL with an example of balancing a bike
• Wrapping up
41
43. Practice problems
Q1: Explain the terms below without using any mathematical notations
(Don’t use any mathematical notation. And you don’t need to use
mathematical definitions strictly)
Q2: Explain what reinforcement learning is, using the terms above
A, Policy
B, Markov Decision Process
C, Expected reward
D, Value
43
45. Some of representative definitions of RL 2
• Wikipedia: “Reinforcement learning (RL) is an interdisciplinary area of
machine learning and optimal control concerned with how an intelligent agent
ought to take actions in a dynamic environment in order to maximize the
cumulative reward.”
• Burto and Sutton’s book: “Reinforcement learning is learning what to do—
how to map situations to actions—so as to maximize a numerical reward
signal.”
45
47. Markov Decision Process (MDP) in some Expressions
Agent Env
Action
Reward
• Typical RL diagram
• State transition diagram • Backup diagram (closed)
• Graphical model
47
Editor's Notes
Reinforcement learning is already used in many ways
For example….
First of all, reinforcement learning is a family of methods to train machine learning models, so reinforcement learning is not really about how to design a structure of a model like neural netowrks
Reinforcement learning is to be on par with supervised learning and unsupervised learning, and these methods can be related to each other, and they cannot be clearly separated as the methods get more complicated
We are going to keep using a term “machine learning” instead of “AI” through this course
First let’s review machine learning mainly with an example of supervised learning.
Machine learning in short wants to numerically approximate the real world blackbox from data,
In supervised learning, parameters in a model is updated basically with disparity between its predictions and correct labels
With the advent of deep learning, expressivity of models advanced rapidly and now they can process highly complicated data like texts and images.
In the figure, a vector is a word, or a label. And a sequence of vectors is a sentence.
In the figure, a matrix means a 1 channel image. And a tensor means a multi channel (RGB) tensor.
But still, it is important to keep it in mind that neural networks are just learning mappings between vectors and vectors or tensor to tensors
Unsupervised learning on the other hand does not need correct labels
Main intention of this is finding structure of data with heuristic, handcrafted rules
Evaluation of this often depends on whether humans can gain any insights or not
There are various ways of training neural networks, but still they are more or less inside the frameworks of supervised or unsupervised learning
Note: we might need to be careful about how to explain generative models
Whether you have supervised data or not, the main idea of machine learning is learning a funciton/mapping by adjusting parameters so that a certain loss function gets smaller.
In supervised learning loss function mainly depends on structures of correct labels, and in unsupervised learning mainly humans needs to make formulas.
And whichever funcitons, an initial function gets closer to the optimal function so that the loss functions gets smaller, and this is usually done by gradient descent
We have seen that supervised learning mainly approximate certain rules with labels, and unsupervised learning finds structures in data with heuristic rules through their trainings
Then what reinforcement learning, in short?
I would, reinforcement learning is a training method to find the optimal aciton in a given state
And reinforcement learning is very unique compared to other two training methods in some ways.
First of all, instead of datasets, in RL you need an environment, which is something like a video game
And as an objective function, RL uses an expected rewards in the long run
And supervision in RL is a bit tricky and also the supervisions come indirectly, sometimes after several timings
These are very different points in RL, and we little by little learn these points in this lecture.
This course is going to be special in these points
We basically use this very famous textbook by Barto and Sutton
But I personally think that it is not most efficient to read this book in the order of the table of contents
Due to the intensions, we will not these topics in RL in this course
Let me introduce
Please don’t panic, but let me introduce definiton or RL with a few terms
You are going to see what these words mean little by little through this course, but now please remember that you just optimize a policy
Also it is important to note that RL considers a simplified environment where you next state and reward only depends on where you are.
Even for complicated environments like video games, the environment is assumed to be an MDp
But learning how to move in each state in such MDP proceses, you can make a long term plans of actions
And such policies and resulting planning are optimized so that expected return is maximized.
Expected return is an expectation of rewards over several time steps.
And a design of how to give rewards of course affects the policy to be optimized
Fro example, if you get a penalty (minum rewards) on every time step, an agent learns to raech the goal as soon as possible, but otherwise, agent learns to take safer paths to avoid the red blocks
An important point to note is that in the beginning of most RL curriculum, you don’t use trial and errors
Instead you learn dynamic programming (DP), which is RL without trial and errors in a sense.
In DP, an agent perfectly knows how an environment works, or the agent knowns the model of the environment
But in RL, the agent does not know how the environment works
And to approximate the effects of DP without the model, you introduce trial and errors to approximate the processes introduced in DP
But in fact, whether to have a model or not is not binary, and there is a gradation between the two states.
You can have a perfect model of an environment like in DP
And you can have no model and just memorize actions in each state.
But as an intermediate solution, you can estimate the model and make planning during trial and errors
Let me call for example “strategy mode settings” between model-based and model-free
One of the most important points in this talk is, we should lways focus on how two functions , a value and a policy, are optimized in RL
And trial and error is actually important, but it can be seen just as one way of sampling data to give supervision to training the value and the policy
I visulized a abstract abstract idea of how a model is trained in supervised or unsupervised, and the modle approches the optimal function with a spervison from a loss function
On the other hand in RL, two functions interactively reches the optimal functions like a zig-zag path
Remember that we just want to optimize the policy in the end, and the value function is indirectly giving supervision to the policy
This idea leads to another strategy mode settings between value-based and polic-based
As I said, we leaern RL algorithms basically come from the idea of this interactive updates of a plicy and a value (GPI)
But as a special case of GPI, you can update only a value function (action value function)
Or in other words, policy is updated as a part of an action value function
What makes most RL curriculum confusing is, DP is first introduced to show the policy-based idea, but after that in practice we can introduce mainly value-based methods for a while
That is simply because we need to introduced a lot of advanced ideas to totally separate a policy from the value
And in fact until the 8th lecture we learn RL through the dissapointingly easyenvironments such as grid maps or state transit diagrams
But rather than introducing deep learning ideas already from the beginning, this is going to be more efficient after studying supervised learning, unsupervised learning
Just as supervised learning or unsupervised learning, as long as you know ther frameworks of training, which models to use is not big problem.
So please be patient, and we prepared enough implementations to play around
As you saw, I introduced some strategy model settings
And also by using axis of value-policy or low-high-expressivity strategy settings, vairous practical algorithms can be classified like this
Based on the tips for learning RL, our lectures are constructed like this.
When you simply balancing of a bike with 5 states and only 3 actions, the MDP can be visualized like this.
As I said earlier, you optimize two functions, value and policy
A value function gives a value of the state, how likely it is to be in a state
And policy is an action making rule only based on where you are
The policy is updated to the direction of higher rewards.
That means the policy is supervised by values, not always by an explicit reward
And the value function is updated based on “experiences”
You make a certain estimation of
The learning comes later, and it is said to be close to neuroscience phenomena
I’m not a neuro science expert, but I think you have experienced something similar in practicing sports, instruments or something
When you have not mastered an action, in practicing you get an intuition that you would fail or get closer to a success. And after that you get a result.
After repeating this, after a break or a sleep, you somehow master the action.
RL is much more simplified, but this indirect supervision is a key of RL
RL iterates these processes.
Values are indirectly learned from ”experiences” and the values give supervision to the policy
Today as an assignment, let’s wrap up the contents today by yourself.