SlideShare a Scribd company logo
1 of 47
Lecture 1:
What is Reinforcement Learning and How Should We Learn It?
Table of Contents
• What is reinforcement learning (RL)?
• Purpose and Specificity of This Course
• How we should learn RL and how this course is structured
• Introducing RL with an example of balancing a bike
• Wrapping up
2
What Reinforcement Learning Can Do?
Robotics
3
Gaming AI
Machine Self-Optimization
Financial Bots
Chat-Bot
Role of Reinforcement Learnig (RL) in AI
Machine learning
AI
Machine
learning
Classical
models
Neural
networks
Models How to train models
Supervised
learning
Unsupervised
learning
Reinforcement
learning
4
Review: Supervised Learning
Blackbox Model
• Approximation of real-world blackbox
• Supervision by disparity between predictions and labels (loss function)
5
Correct
Model
Incorrect
Correct
Supervision
Deep Learning: Model as Neural Networks
• NLP
Verctors → Vectors
A tensor → Vectors/tensors
Positive
Negative
Cat
Dog
Horse
Goat…
• Image processing
6
Unsupervised Learning
• Dimension reduction • Clustering
Handcrafted
rules
Handcrafted
rules
7
Review: Various Training Examples for Deep Learning
• Classification
Verctors → Vectors
Positive
Negative
Correct
label
• Regression (e.g. translation)
Verctors → Vectors
Correct
translation
• ChatGPT
Verctors → Vectors
No. 1
No. 2
No. 3
Giving rankings
to outputs
8
Review: Supervised or Unsupervised Training of ML Models
• Supervised/unsupervised learning framework
Data ML model
Optimization
Supervising data
Optimization with gradient descent
(to the direction where a loss
function decreases)
• How the ML model is optimized
9
Optimal function
Initialized function
Basic Ideas of ML Types
• Supervised learning
(approximating functions)
• Unsupervised learning
(finding structures in data with heuristic rules)
• Reinforcement learning (finding the best action in each state)
No
move
Lean
left
Lean
right
10
Differences of Three Major Training Methods
• Data: inputs and labels
• Objective: metrics such as accuracy
• Supervision: differences between
prediction and labels
• Directness: direct supervision
• Timings: immediate supervision
Supervised learning
• Data: only inputs
• Objective: some insights to humanas
• Supervision: hand-crafted loss
• Directness: indirect supervision
• Timings: immediate supervision
Unsupervised learning
• Data: an environment
• Objective: expected return
• Supervision: differences between
expectations and actual rewards
• Directness: indirect supervision
• Timings: indirect supervision after some steps
Reinforcement learning No move
Lean left
Lean right
11
Table of Contents
• What is reinforcement learning (RL)?
• Purpose and Specificity of This Course
• How we should learn RL and how this course is structured
• Introducing RL with an example of balancing a bike
• Wrapping up
12
The Purpose and Specifity of This Course
Reaching deep RL as efficiently as possible at an implementation level
13
• We organized contents to emphasize that RL algorithms come from one
core idea (GPI: genralized policy iteration)
• We cover minimum contents deeply to prioritize having a big overview on
RL and reaching deep RL
• And we always tell you where you are now, and what is the limits of
scopes covered in each lecture or the whole course
Textbook and side reader
• The most famous, popular
• Notations in this lecture follow this book
• Available for free
• A lot of practical examples
• Not necessarily recommended to read
everything in the order of this book
14
*Topics Not Covered by This Course
• Precise mathematical derivation
• Eligibility traces
• Details of RL with function approximation
• Partially observable Markov decision process
15
Table of Contents
• What is reinforcement learning (RL)?
• Purpose and Specificity of This Course
• How we should learn RL and how this course is structured
• Introducing RL with an example of balancing a bike
• Wrapping up
16
Tips for Making it Easier to Study RL in the Beginning
• In RL, ultimately you just want to optimize a policy, a probability of taking
an action only based on where you are.
• In the beginning, you study dynamic programming (DP), a kind of RL without trial
and errors, and later you in a sense approximate DP with trial and errors.
• Unlike typical supervised or unsupervised learning, two functions, a value and
policy, are interactively optimized in DP and RL.
• In the beginning, you work on disappointingly simple environments.
• Numbers of algorithms introduced in most textbooks come from one fundamental
ideas.
17
First of all: a Little Detailed Definition of RL
• Sequential decision making: optimizing a sequence of actions
• Policy: a function mapping a state to an action
• Markov decision processes: next action only depends on the current state
• Expected return: an expectation of a sum of rewards over time steps
Reinforcement learning is a sequential decision making problem by
optimizing a policy mainly in a Markov decision process such that an
expected return is maximaized.
Let’s see what this means by keeping
some important points in mind for now
18
Markov Decision Process and Environments
19
• Markov decision processes: next action only depends on the current state
How you reached the state does not matter
Sequential Decision Making by Optimizing a Policy
No
move
Lean
left
Lean
right
20
• Policy: a function mapping a state to an action (arrows in the figures)
• Sequential decision making: optimizing a sequence of actions
Expected Return
Policies are optimized so that an agent
reaches the goal as soon as possible
Policies are optimized so that an agent
doesn’t approach penalty areas 21
Cells have to be defined/explained
• Expected return: an expectation of a sum of rewards
over time steps
: positive reward
: negative reward
: unmovable
Tips for Making it Easier to Study RL in the Beginning
• In RL, ultimately you just want to optimize a policy, a probability of taking an
action only based on where you are.
• In the beginning, you study dynamic programming (DP), a kind of RL
without trial and errors, and later you in a sense approximate DP with trial
and errors.
• Unlike typical supervised or unsupervised learning, two functions, a value and
policy, are interactively optimized in DP and RL.
• In the beginning, you work on disappointingly simple environments.
• Numbers of algorithms introduced in most textbooks come from one fundamental
ideas.
22
The main purpose lecture 2, 3
 Planning, dynamic
programming
(“RL” without trial and errors)
Agent
Environment
Action
Reward
Agent
 RL
(planning with trial and errors)
Processes of planning are
approximated with
“experiences” of the agent 23
Model-based or model-free
 Planning  Model-free RL
Model-based Model-free
 Model-based RL
24
Tips for Making it Easier to Study RL in the Beginning
• In RL, ultimately you just want to optimize a policy, a probability of taking an
action only based on where you are.
• In the beginning, you study dynamic programming (DP), a kind of RL without trial
and errors, and later you in a sense approximate DP with trial and errors.
• Unlike typical supervised or unsupervised learning, two functions, a value
and policy, are interactively optimized in DP and RL.
• In the beginning, you work on disappointingly simple environments.
• Numbers of algorithms introduced in most textbooks come from one fundamental
ideas.
25
The Core Idea through This Lecture
This part should be
emphasized more
Agent
Environment
Action
Reward
Value
Policy
26
Optimization in RL Training: Interactive Updates of Value and Policy
• Supervised/unsupervised learning
Data ML model
Optimization
Supervising
data
Policy evaluation
Policy improvement
Agent
Environment
Action
Reward
ML model
ML model
Policy
Value
• Reinforcement Learning
To the direction where a loss function
decreases
To the direction
where an expected
reward increases, but
in a zig-zag path
27
Value or Policy
Agent
Environment
Action
Reward
ML model
ML model
Policy
State value
Policy + value
Value
Agent
Environment
Action
Reward
ML model
Action value
The optimal value and policy
can be derived from
28
Tips for Making it Easier to Study RL in the Beginning
• In RL, ultimately you just want to optimize a policy, a probability of taking an
action only based on where you are.
• In the beginning, you study dynamic programming (DP), a kind of RL without trial
and errors, and later you in a sense approximate DP with trial and errors.
• Unlike typical supervised or unsupervised learning, two functions, a value and
policy, are interactively optimized in DP and RL.
• In the beginning, you work on disappointingly simple environments.
• Numbers of algorithms introduced in most textbooks come from one fundamental
ideas.
29
Expressivity of environemtns
RL with tabular data RL with classical function approximation Deep RL
Agent
Environment
Action
Reward
Next state
Value
Tabular
data
Policy
Tabular
data
High expressivity
Low expressivity
Agent
Environment
Action
Reward
Next state
Value
Tabular
data
Policy
Tabular
data
Agent
Environment
Action
Reward
Next state
Value
Tabular
data
Policy
Tabular
data
30
Environments explanation
Tips for Making it Easier to Study RL in the Beginning
• In RL, ultimately you just want to optimize a policy, a probability of taking an
action only based on where you are.
• In the beginning, you study dynamic programming (DP), a kind of RL without trial
and errors, and later you in a sense approximate DP with trial and errors.
• Unlike typical supervised or unsupervised learning, two functions, a value and
policy, are interactively optimized in DP and RL.
• In the beginning, you work disappointingly simple environments.
• Numbers of algorithms introduced in most textbooks come from one
fundamental ideas.
31
Generalized Policy Iteration (GPI)
Model
-based
Model
-free
Policy iteration
Policy iteration
Lecture 3
SARSA
Q-learning
Lecture 4
Dyna Q
Lecture 7
Tabular actor-critic method
with advantage function
Lecture 8
Policy + value
Value-based
Alpha Zero
Generalized Policy Iteration (GPI) Low expressivity
High expressivity
Course Schedule
1. What is Reinforcement Learning (RL) and how should we learn it?
2. Dynamic programming (DP), expression of Markov Decision Process
3. (Implementation exercise) DP with policy iteration, value iteration
4. TD-learning: introducing ”experience” and “trial and errors”
5. TD or Monte Carlo, Exploration or Exploitation
6. (Implementation exercise) Model-free RL with with Open AI Gym Format
7. Model-based RL and searching
8. Understanding RL so far as combinations of strategy mode settings
9. RL with function approximation: Approximation of Value Function
10. RL with function approximation: Approximation of Policy
11. (Practical implementation) Stock market environment from yfinance
12. (Practical implementation) Deep Q Networks with Video Game
13. (Buffer)
14. (Buffer)
Making environements and agents richer
”RL” without trial and errors
Agent
Value
Policy
Introducing “experiences” in
and ”trial and errors” in RL
Environment
Action
Reward
Agent
Value
Policy
Elaborating RL training
Value Policy
Model-free Model-based
Exploitation Exploration
34
Advanced topics,
implementations
Low expressivity High expressivity
Table of Contents
• What is reinforcement learning (RL)?
• Purpose and Specificity of This Course
• How we should learn RL and how this course is structured
• Introducing RL with an example of balancing a bike
• Wrapping up
35
MDP with an Example of Balancing a Bike
Or
State 0
State 1
State 2
State 3
State 4
Leaning left
No move
Leaning right 36
Values and Policies: with an Example of Balancing a Bike
• Value: how good it is to be in a state
• Policy: a probability of taking an action in a state
State 0:
minus reward
State 1:
low value
State 2:
high value
State 3:
low value
State 4:
minus reward
Action 0:
Low probability
Action 2:
High probability
Action 1
37
Policy updates
• Higher probability on actions to the direction of high values
State 0:
minus reward
State 1:
low value
State 2:
high value
Action 0:
leaning left
Action 1:
leaning right
Then how can a value be learned?
Giving higher probability
38
Value update: How to Learn from „Experiences“
 Updating values by filling a gap between expectation and actual rewards
If you lean left, the
values is low. As expected!
TD loss is low
Leaning right would
not be good because
value is low.
I was wrong.
There is no bad reward.
Let’s update the value.
TD loss is high
Learning could happen without explicit rewards
39
Interactive Updates of Value and Policy
Value updates
(closing gaps
between expectation
and real rewards)
Policy updates
(taking actions
toward higher value)
Low-valued state
Medium-valued state
High-valued state
Medium-valued state
Low-valued state
40
Table of Contents
• What is reinforcement learning (RL)?
• Purpose and Specificity of This Course
• How we should learn RL and how this course is structured
• Introducing RL with an example of balancing a bike
• Wrapping up
41
Wrapping Up
Let’s wrap things up by yourself this time
42
Practice problems
 Q1: Explain the terms below without using any mathematical notations
(Don’t use any mathematical notation. And you don’t need to use
mathematical definitions strictly)
 Q2: Explain what reinforcement learning is, using the terms above
A, Policy
B, Markov Decision Process
C, Expected reward
D, Value
43
Supplementary slides
44
Some of representative definitions of RL 2
• Wikipedia: “Reinforcement learning (RL) is an interdisciplinary area of
machine learning and optimal control concerned with how an intelligent agent
ought to take actions in a dynamic environment in order to maximize the
cumulative reward.”
• Burto and Sutton’s book: “Reinforcement learning is learning what to do—
how to map situations to actions—so as to maximize a numerical reward
signal.”
45
Some of representative definitions of RL 2
• ChatGPT 3.5 (16.11.2023)
46
Markov Decision Process (MDP) in some Expressions
Agent Env
Action
Reward
• Typical RL diagram
• State transition diagram • Backup diagram (closed)
• Graphical model
47

More Related Content

Similar to Reinforcement course material samples: lecture 1

Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningNAVER Engineering
 
Harm van Seijen, Research Scientist, Maluuba at MLconf SF 2016
Harm van Seijen, Research Scientist, Maluuba at MLconf SF 2016Harm van Seijen, Research Scientist, Maluuba at MLconf SF 2016
Harm van Seijen, Research Scientist, Maluuba at MLconf SF 2016MLconf
 
5 learning edited 2012.ppt
5 learning edited 2012.ppt5 learning edited 2012.ppt
5 learning edited 2012.pptHenokGetachew15
 
Making smart decisions in real-time with Reinforcement Learning
Making smart decisions in real-time with Reinforcement LearningMaking smart decisions in real-time with Reinforcement Learning
Making smart decisions in real-time with Reinforcement LearningRuth Yakubu
 
Modern Recommendation for Advanced Practitioners part2
Modern Recommendation for Advanced Practitioners part2Modern Recommendation for Advanced Practitioners part2
Modern Recommendation for Advanced Practitioners part2Flavian Vasile
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningDongHyun Kwak
 
Tace 5 leadership training
Tace 5 leadership trainingTace 5 leadership training
Tace 5 leadership trainingLou Adams
 
STAT7440StudentIMLPresentationJishan.pptx
STAT7440StudentIMLPresentationJishan.pptxSTAT7440StudentIMLPresentationJishan.pptx
STAT7440StudentIMLPresentationJishan.pptxJishanAhmed24
 
Intro to Reinforcement learning - part II
Intro to Reinforcement learning - part IIIntro to Reinforcement learning - part II
Intro to Reinforcement learning - part IIMikko Mäkipää
 
Goal Decomposition and Abductive Reasoning for Policy Analysis and Refinement
Goal Decomposition and Abductive Reasoning for Policy Analysis and RefinementGoal Decomposition and Abductive Reasoning for Policy Analysis and Refinement
Goal Decomposition and Abductive Reasoning for Policy Analysis and RefinementEmil Lupu
 
TensorFlow London 11: Pierre Harvey Richemond 'Trends and Developments in Rei...
TensorFlow London 11: Pierre Harvey Richemond 'Trends and Developments in Rei...TensorFlow London 11: Pierre Harvey Richemond 'Trends and Developments in Rei...
TensorFlow London 11: Pierre Harvey Richemond 'Trends and Developments in Rei...Seldon
 
Exploration Strategies in Reinforcement Learning
Exploration Strategies in Reinforcement LearningExploration Strategies in Reinforcement Learning
Exploration Strategies in Reinforcement LearningDongmin Lee
 
Lecture 1--Aug 29.ppt11111111111111111111111111111
Lecture 1--Aug 29.ppt11111111111111111111111111111Lecture 1--Aug 29.ppt11111111111111111111111111111
Lecture 1--Aug 29.ppt11111111111111111111111111111hpoulady
 
Kaggle Days Paris - Alberto Danese - ML Interpretability
Kaggle Days Paris - Alberto Danese - ML InterpretabilityKaggle Days Paris - Alberto Danese - ML Interpretability
Kaggle Days Paris - Alberto Danese - ML InterpretabilityAlberto Danese
 
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...Dongmin Lee
 
2CPP05 - Modelling an Object Oriented Program
2CPP05 - Modelling an Object Oriented Program2CPP05 - Modelling an Object Oriented Program
2CPP05 - Modelling an Object Oriented ProgramMichael Heron
 
Logic_Models_for_Program_Design_Implementation_and_Evaluation_Slide_Deck.pptx
Logic_Models_for_Program_Design_Implementation_and_Evaluation_Slide_Deck.pptxLogic_Models_for_Program_Design_Implementation_and_Evaluation_Slide_Deck.pptx
Logic_Models_for_Program_Design_Implementation_and_Evaluation_Slide_Deck.pptxabhishekdeo19
 
LSCon 2017 Making Future-focused Platform Decisions with the xAPI
LSCon 2017 Making Future-focused Platform Decisions with the xAPILSCon 2017 Making Future-focused Platform Decisions with the xAPI
LSCon 2017 Making Future-focused Platform Decisions with the xAPITorranceLearning
 

Similar to Reinforcement course material samples: lecture 1 (20)

Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
 
Harm van Seijen, Research Scientist, Maluuba at MLconf SF 2016
Harm van Seijen, Research Scientist, Maluuba at MLconf SF 2016Harm van Seijen, Research Scientist, Maluuba at MLconf SF 2016
Harm van Seijen, Research Scientist, Maluuba at MLconf SF 2016
 
5 learning edited 2012.ppt
5 learning edited 2012.ppt5 learning edited 2012.ppt
5 learning edited 2012.ppt
 
Making smart decisions in real-time with Reinforcement Learning
Making smart decisions in real-time with Reinforcement LearningMaking smart decisions in real-time with Reinforcement Learning
Making smart decisions in real-time with Reinforcement Learning
 
Modern Recommendation for Advanced Practitioners part2
Modern Recommendation for Advanced Practitioners part2Modern Recommendation for Advanced Practitioners part2
Modern Recommendation for Advanced Practitioners part2
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
FDS Unit I_PPT.pptx
FDS Unit I_PPT.pptxFDS Unit I_PPT.pptx
FDS Unit I_PPT.pptx
 
Tace 5 leadership training
Tace 5 leadership trainingTace 5 leadership training
Tace 5 leadership training
 
STAT7440StudentIMLPresentationJishan.pptx
STAT7440StudentIMLPresentationJishan.pptxSTAT7440StudentIMLPresentationJishan.pptx
STAT7440StudentIMLPresentationJishan.pptx
 
Intro to Reinforcement learning - part II
Intro to Reinforcement learning - part IIIntro to Reinforcement learning - part II
Intro to Reinforcement learning - part II
 
Goal Decomposition and Abductive Reasoning for Policy Analysis and Refinement
Goal Decomposition and Abductive Reasoning for Policy Analysis and RefinementGoal Decomposition and Abductive Reasoning for Policy Analysis and Refinement
Goal Decomposition and Abductive Reasoning for Policy Analysis and Refinement
 
TensorFlow London 11: Pierre Harvey Richemond 'Trends and Developments in Rei...
TensorFlow London 11: Pierre Harvey Richemond 'Trends and Developments in Rei...TensorFlow London 11: Pierre Harvey Richemond 'Trends and Developments in Rei...
TensorFlow London 11: Pierre Harvey Richemond 'Trends and Developments in Rei...
 
Exploration Strategies in Reinforcement Learning
Exploration Strategies in Reinforcement LearningExploration Strategies in Reinforcement Learning
Exploration Strategies in Reinforcement Learning
 
Lecture 1--Aug 29.ppt11111111111111111111111111111
Lecture 1--Aug 29.ppt11111111111111111111111111111Lecture 1--Aug 29.ppt11111111111111111111111111111
Lecture 1--Aug 29.ppt11111111111111111111111111111
 
Kaggle Days Paris - Alberto Danese - ML Interpretability
Kaggle Days Paris - Alberto Danese - ML InterpretabilityKaggle Days Paris - Alberto Danese - ML Interpretability
Kaggle Days Paris - Alberto Danese - ML Interpretability
 
Feature Selection.pdf
Feature Selection.pdfFeature Selection.pdf
Feature Selection.pdf
 
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
 
2CPP05 - Modelling an Object Oriented Program
2CPP05 - Modelling an Object Oriented Program2CPP05 - Modelling an Object Oriented Program
2CPP05 - Modelling an Object Oriented Program
 
Logic_Models_for_Program_Design_Implementation_and_Evaluation_Slide_Deck.pptx
Logic_Models_for_Program_Design_Implementation_and_Evaluation_Slide_Deck.pptxLogic_Models_for_Program_Design_Implementation_and_Evaluation_Slide_Deck.pptx
Logic_Models_for_Program_Design_Implementation_and_Evaluation_Slide_Deck.pptx
 
LSCon 2017 Making Future-focused Platform Decisions with the xAPI
LSCon 2017 Making Future-focused Platform Decisions with the xAPILSCon 2017 Making Future-focused Platform Decisions with the xAPI
LSCon 2017 Making Future-focused Platform Decisions with the xAPI
 

More from YasutoTamura1

NLP_deep_learning_intro.pptx
NLP_deep_learning_intro.pptxNLP_deep_learning_intro.pptx
NLP_deep_learning_intro.pptxYasutoTamura1
 
Brief instruction on backprop
Brief instruction on backpropBrief instruction on backprop
Brief instruction on backpropYasutoTamura1
 
Illustrative Introductory CNN
Illustrative Introductory CNNIllustrative Introductory CNN
Illustrative Introductory CNNYasutoTamura1
 
Illustrative Introductory Neural Networks
Illustrative Introductory Neural NetworksIllustrative Introductory Neural Networks
Illustrative Introductory Neural NetworksYasutoTamura1
 
Precise LSTM Algorithm
Precise LSTM AlgorithmPrecise LSTM Algorithm
Precise LSTM AlgorithmYasutoTamura1
 
simple_rnn_forward_back_propagation
simple_rnn_forward_back_propagationsimple_rnn_forward_back_propagation
simple_rnn_forward_back_propagationYasutoTamura1
 

More from YasutoTamura1 (6)

NLP_deep_learning_intro.pptx
NLP_deep_learning_intro.pptxNLP_deep_learning_intro.pptx
NLP_deep_learning_intro.pptx
 
Brief instruction on backprop
Brief instruction on backpropBrief instruction on backprop
Brief instruction on backprop
 
Illustrative Introductory CNN
Illustrative Introductory CNNIllustrative Introductory CNN
Illustrative Introductory CNN
 
Illustrative Introductory Neural Networks
Illustrative Introductory Neural NetworksIllustrative Introductory Neural Networks
Illustrative Introductory Neural Networks
 
Precise LSTM Algorithm
Precise LSTM AlgorithmPrecise LSTM Algorithm
Precise LSTM Algorithm
 
simple_rnn_forward_back_propagation
simple_rnn_forward_back_propagationsimple_rnn_forward_back_propagation
simple_rnn_forward_back_propagation
 

Recently uploaded

Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 

Recently uploaded (20)

Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 

Reinforcement course material samples: lecture 1

  • 1. Lecture 1: What is Reinforcement Learning and How Should We Learn It?
  • 2. Table of Contents • What is reinforcement learning (RL)? • Purpose and Specificity of This Course • How we should learn RL and how this course is structured • Introducing RL with an example of balancing a bike • Wrapping up 2
  • 3. What Reinforcement Learning Can Do? Robotics 3 Gaming AI Machine Self-Optimization Financial Bots Chat-Bot
  • 4. Role of Reinforcement Learnig (RL) in AI Machine learning AI Machine learning Classical models Neural networks Models How to train models Supervised learning Unsupervised learning Reinforcement learning 4
  • 5. Review: Supervised Learning Blackbox Model • Approximation of real-world blackbox • Supervision by disparity between predictions and labels (loss function) 5 Correct Model Incorrect Correct Supervision
  • 6. Deep Learning: Model as Neural Networks • NLP Verctors → Vectors A tensor → Vectors/tensors Positive Negative Cat Dog Horse Goat… • Image processing 6
  • 7. Unsupervised Learning • Dimension reduction • Clustering Handcrafted rules Handcrafted rules 7
  • 8. Review: Various Training Examples for Deep Learning • Classification Verctors → Vectors Positive Negative Correct label • Regression (e.g. translation) Verctors → Vectors Correct translation • ChatGPT Verctors → Vectors No. 1 No. 2 No. 3 Giving rankings to outputs 8
  • 9. Review: Supervised or Unsupervised Training of ML Models • Supervised/unsupervised learning framework Data ML model Optimization Supervising data Optimization with gradient descent (to the direction where a loss function decreases) • How the ML model is optimized 9 Optimal function Initialized function
  • 10. Basic Ideas of ML Types • Supervised learning (approximating functions) • Unsupervised learning (finding structures in data with heuristic rules) • Reinforcement learning (finding the best action in each state) No move Lean left Lean right 10
  • 11. Differences of Three Major Training Methods • Data: inputs and labels • Objective: metrics such as accuracy • Supervision: differences between prediction and labels • Directness: direct supervision • Timings: immediate supervision Supervised learning • Data: only inputs • Objective: some insights to humanas • Supervision: hand-crafted loss • Directness: indirect supervision • Timings: immediate supervision Unsupervised learning • Data: an environment • Objective: expected return • Supervision: differences between expectations and actual rewards • Directness: indirect supervision • Timings: indirect supervision after some steps Reinforcement learning No move Lean left Lean right 11
  • 12. Table of Contents • What is reinforcement learning (RL)? • Purpose and Specificity of This Course • How we should learn RL and how this course is structured • Introducing RL with an example of balancing a bike • Wrapping up 12
  • 13. The Purpose and Specifity of This Course Reaching deep RL as efficiently as possible at an implementation level 13 • We organized contents to emphasize that RL algorithms come from one core idea (GPI: genralized policy iteration) • We cover minimum contents deeply to prioritize having a big overview on RL and reaching deep RL • And we always tell you where you are now, and what is the limits of scopes covered in each lecture or the whole course
  • 14. Textbook and side reader • The most famous, popular • Notations in this lecture follow this book • Available for free • A lot of practical examples • Not necessarily recommended to read everything in the order of this book 14
  • 15. *Topics Not Covered by This Course • Precise mathematical derivation • Eligibility traces • Details of RL with function approximation • Partially observable Markov decision process 15
  • 16. Table of Contents • What is reinforcement learning (RL)? • Purpose and Specificity of This Course • How we should learn RL and how this course is structured • Introducing RL with an example of balancing a bike • Wrapping up 16
  • 17. Tips for Making it Easier to Study RL in the Beginning • In RL, ultimately you just want to optimize a policy, a probability of taking an action only based on where you are. • In the beginning, you study dynamic programming (DP), a kind of RL without trial and errors, and later you in a sense approximate DP with trial and errors. • Unlike typical supervised or unsupervised learning, two functions, a value and policy, are interactively optimized in DP and RL. • In the beginning, you work on disappointingly simple environments. • Numbers of algorithms introduced in most textbooks come from one fundamental ideas. 17
  • 18. First of all: a Little Detailed Definition of RL • Sequential decision making: optimizing a sequence of actions • Policy: a function mapping a state to an action • Markov decision processes: next action only depends on the current state • Expected return: an expectation of a sum of rewards over time steps Reinforcement learning is a sequential decision making problem by optimizing a policy mainly in a Markov decision process such that an expected return is maximaized. Let’s see what this means by keeping some important points in mind for now 18
  • 19. Markov Decision Process and Environments 19 • Markov decision processes: next action only depends on the current state How you reached the state does not matter
  • 20. Sequential Decision Making by Optimizing a Policy No move Lean left Lean right 20 • Policy: a function mapping a state to an action (arrows in the figures) • Sequential decision making: optimizing a sequence of actions
  • 21. Expected Return Policies are optimized so that an agent reaches the goal as soon as possible Policies are optimized so that an agent doesn’t approach penalty areas 21 Cells have to be defined/explained • Expected return: an expectation of a sum of rewards over time steps : positive reward : negative reward : unmovable
  • 22. Tips for Making it Easier to Study RL in the Beginning • In RL, ultimately you just want to optimize a policy, a probability of taking an action only based on where you are. • In the beginning, you study dynamic programming (DP), a kind of RL without trial and errors, and later you in a sense approximate DP with trial and errors. • Unlike typical supervised or unsupervised learning, two functions, a value and policy, are interactively optimized in DP and RL. • In the beginning, you work on disappointingly simple environments. • Numbers of algorithms introduced in most textbooks come from one fundamental ideas. 22
  • 23. The main purpose lecture 2, 3  Planning, dynamic programming (“RL” without trial and errors) Agent Environment Action Reward Agent  RL (planning with trial and errors) Processes of planning are approximated with “experiences” of the agent 23
  • 24. Model-based or model-free  Planning  Model-free RL Model-based Model-free  Model-based RL 24
  • 25. Tips for Making it Easier to Study RL in the Beginning • In RL, ultimately you just want to optimize a policy, a probability of taking an action only based on where you are. • In the beginning, you study dynamic programming (DP), a kind of RL without trial and errors, and later you in a sense approximate DP with trial and errors. • Unlike typical supervised or unsupervised learning, two functions, a value and policy, are interactively optimized in DP and RL. • In the beginning, you work on disappointingly simple environments. • Numbers of algorithms introduced in most textbooks come from one fundamental ideas. 25
  • 26. The Core Idea through This Lecture This part should be emphasized more Agent Environment Action Reward Value Policy 26
  • 27. Optimization in RL Training: Interactive Updates of Value and Policy • Supervised/unsupervised learning Data ML model Optimization Supervising data Policy evaluation Policy improvement Agent Environment Action Reward ML model ML model Policy Value • Reinforcement Learning To the direction where a loss function decreases To the direction where an expected reward increases, but in a zig-zag path 27
  • 28. Value or Policy Agent Environment Action Reward ML model ML model Policy State value Policy + value Value Agent Environment Action Reward ML model Action value The optimal value and policy can be derived from 28
  • 29. Tips for Making it Easier to Study RL in the Beginning • In RL, ultimately you just want to optimize a policy, a probability of taking an action only based on where you are. • In the beginning, you study dynamic programming (DP), a kind of RL without trial and errors, and later you in a sense approximate DP with trial and errors. • Unlike typical supervised or unsupervised learning, two functions, a value and policy, are interactively optimized in DP and RL. • In the beginning, you work on disappointingly simple environments. • Numbers of algorithms introduced in most textbooks come from one fundamental ideas. 29
  • 30. Expressivity of environemtns RL with tabular data RL with classical function approximation Deep RL Agent Environment Action Reward Next state Value Tabular data Policy Tabular data High expressivity Low expressivity Agent Environment Action Reward Next state Value Tabular data Policy Tabular data Agent Environment Action Reward Next state Value Tabular data Policy Tabular data 30 Environments explanation
  • 31. Tips for Making it Easier to Study RL in the Beginning • In RL, ultimately you just want to optimize a policy, a probability of taking an action only based on where you are. • In the beginning, you study dynamic programming (DP), a kind of RL without trial and errors, and later you in a sense approximate DP with trial and errors. • Unlike typical supervised or unsupervised learning, two functions, a value and policy, are interactively optimized in DP and RL. • In the beginning, you work disappointingly simple environments. • Numbers of algorithms introduced in most textbooks come from one fundamental ideas. 31
  • 32. Generalized Policy Iteration (GPI) Model -based Model -free Policy iteration Policy iteration Lecture 3 SARSA Q-learning Lecture 4 Dyna Q Lecture 7 Tabular actor-critic method with advantage function Lecture 8 Policy + value Value-based Alpha Zero
  • 33. Generalized Policy Iteration (GPI) Low expressivity High expressivity
  • 34. Course Schedule 1. What is Reinforcement Learning (RL) and how should we learn it? 2. Dynamic programming (DP), expression of Markov Decision Process 3. (Implementation exercise) DP with policy iteration, value iteration 4. TD-learning: introducing ”experience” and “trial and errors” 5. TD or Monte Carlo, Exploration or Exploitation 6. (Implementation exercise) Model-free RL with with Open AI Gym Format 7. Model-based RL and searching 8. Understanding RL so far as combinations of strategy mode settings 9. RL with function approximation: Approximation of Value Function 10. RL with function approximation: Approximation of Policy 11. (Practical implementation) Stock market environment from yfinance 12. (Practical implementation) Deep Q Networks with Video Game 13. (Buffer) 14. (Buffer) Making environements and agents richer ”RL” without trial and errors Agent Value Policy Introducing “experiences” in and ”trial and errors” in RL Environment Action Reward Agent Value Policy Elaborating RL training Value Policy Model-free Model-based Exploitation Exploration 34 Advanced topics, implementations Low expressivity High expressivity
  • 35. Table of Contents • What is reinforcement learning (RL)? • Purpose and Specificity of This Course • How we should learn RL and how this course is structured • Introducing RL with an example of balancing a bike • Wrapping up 35
  • 36. MDP with an Example of Balancing a Bike Or State 0 State 1 State 2 State 3 State 4 Leaning left No move Leaning right 36
  • 37. Values and Policies: with an Example of Balancing a Bike • Value: how good it is to be in a state • Policy: a probability of taking an action in a state State 0: minus reward State 1: low value State 2: high value State 3: low value State 4: minus reward Action 0: Low probability Action 2: High probability Action 1 37
  • 38. Policy updates • Higher probability on actions to the direction of high values State 0: minus reward State 1: low value State 2: high value Action 0: leaning left Action 1: leaning right Then how can a value be learned? Giving higher probability 38
  • 39. Value update: How to Learn from „Experiences“  Updating values by filling a gap between expectation and actual rewards If you lean left, the values is low. As expected! TD loss is low Leaning right would not be good because value is low. I was wrong. There is no bad reward. Let’s update the value. TD loss is high Learning could happen without explicit rewards 39
  • 40. Interactive Updates of Value and Policy Value updates (closing gaps between expectation and real rewards) Policy updates (taking actions toward higher value) Low-valued state Medium-valued state High-valued state Medium-valued state Low-valued state 40
  • 41. Table of Contents • What is reinforcement learning (RL)? • Purpose and Specificity of This Course • How we should learn RL and how this course is structured • Introducing RL with an example of balancing a bike • Wrapping up 41
  • 42. Wrapping Up Let’s wrap things up by yourself this time 42
  • 43. Practice problems  Q1: Explain the terms below without using any mathematical notations (Don’t use any mathematical notation. And you don’t need to use mathematical definitions strictly)  Q2: Explain what reinforcement learning is, using the terms above A, Policy B, Markov Decision Process C, Expected reward D, Value 43
  • 45. Some of representative definitions of RL 2 • Wikipedia: “Reinforcement learning (RL) is an interdisciplinary area of machine learning and optimal control concerned with how an intelligent agent ought to take actions in a dynamic environment in order to maximize the cumulative reward.” • Burto and Sutton’s book: “Reinforcement learning is learning what to do— how to map situations to actions—so as to maximize a numerical reward signal.” 45
  • 46. Some of representative definitions of RL 2 • ChatGPT 3.5 (16.11.2023) 46
  • 47. Markov Decision Process (MDP) in some Expressions Agent Env Action Reward • Typical RL diagram • State transition diagram • Backup diagram (closed) • Graphical model 47

Editor's Notes

  1. Reinforcement learning is already used in many ways For example….
  2. First of all, reinforcement learning is a family of methods to train machine learning models, so reinforcement learning is not really about how to design a structure of a model like neural netowrks Reinforcement learning is to be on par with supervised learning and unsupervised learning, and these methods can be related to each other, and they cannot be clearly separated as the methods get more complicated We are going to keep using a term “machine learning” instead of “AI” through this course
  3. First let’s review machine learning mainly with an example of supervised learning. Machine learning in short wants to numerically approximate the real world blackbox from data, In supervised learning, parameters in a model is updated basically with disparity between its predictions and correct labels
  4. With the advent of deep learning, expressivity of models advanced rapidly and now they can process highly complicated data like texts and images. In the figure, a vector is a word, or a label. And a sequence of vectors is a sentence. In the figure, a matrix means a 1 channel image. And a tensor means a multi channel (RGB) tensor. But still, it is important to keep it in mind that neural networks are just learning mappings between vectors and vectors or tensor to tensors
  5. Unsupervised learning on the other hand does not need correct labels Main intention of this is finding structure of data with heuristic, handcrafted rules Evaluation of this often depends on whether humans can gain any insights or not
  6. There are various ways of training neural networks, but still they are more or less inside the frameworks of supervised or unsupervised learning Note: we might need to be careful about how to explain generative models
  7. Whether you have supervised data or not, the main idea of machine learning is learning a funciton/mapping by adjusting parameters so that a certain loss function gets smaller. In supervised learning loss function mainly depends on structures of correct labels, and in unsupervised learning mainly humans needs to make formulas. And whichever funcitons, an initial function gets closer to the optimal function so that the loss functions gets smaller, and this is usually done by gradient descent
  8. We have seen that supervised learning mainly approximate certain rules with labels, and unsupervised learning finds structures in data with heuristic rules through their trainings Then what reinforcement learning, in short? I would, reinforcement learning is a training method to find the optimal aciton in a given state
  9. And reinforcement learning is very unique compared to other two training methods in some ways. First of all, instead of datasets, in RL you need an environment, which is something like a video game And as an objective function, RL uses an expected rewards in the long run And supervision in RL is a bit tricky and also the supervisions come indirectly, sometimes after several timings These are very different points in RL, and we little by little learn these points in this lecture.
  10. This course is going to be special in these points
  11. We basically use this very famous textbook by Barto and Sutton But I personally think that it is not most efficient to read this book in the order of the table of contents
  12. Due to the intensions, we will not these topics in RL in this course
  13. Let me introduce
  14. Please don’t panic, but let me introduce definiton or RL with a few terms You are going to see what these words mean little by little through this course, but now please remember that you just optimize a policy
  15. Also it is important to note that RL considers a simplified environment where you next state and reward only depends on where you are. Even for complicated environments like video games, the environment is assumed to be an MDp
  16. But learning how to move in each state in such MDP proceses, you can make a long term plans of actions
  17. And such policies and resulting planning are optimized so that expected return is maximized. Expected return is an expectation of rewards over several time steps. And a design of how to give rewards of course affects the policy to be optimized Fro example, if you get a penalty (minum rewards) on every time step, an agent learns to raech the goal as soon as possible, but otherwise, agent learns to take safer paths to avoid the red blocks
  18. An important point to note is that in the beginning of most RL curriculum, you don’t use trial and errors Instead you learn dynamic programming (DP), which is RL without trial and errors in a sense. In DP, an agent perfectly knows how an environment works, or the agent knowns the model of the environment But in RL, the agent does not know how the environment works And to approximate the effects of DP without the model, you introduce trial and errors to approximate the processes introduced in DP
  19. But in fact, whether to have a model or not is not binary, and there is a gradation between the two states. You can have a perfect model of an environment like in DP And you can have no model and just memorize actions in each state. But as an intermediate solution, you can estimate the model and make planning during trial and errors Let me call for example “strategy mode settings” between model-based and model-free
  20. One of the most important points in this talk is, we should lways focus on how two functions , a value and a policy, are optimized in RL And trial and error is actually important, but it can be seen just as one way of sampling data to give supervision to training the value and the policy
  21. I visulized a abstract abstract idea of how a model is trained in supervised or unsupervised, and the modle approches the optimal function with a spervison from a loss function On the other hand in RL, two functions interactively reches the optimal functions like a zig-zag path Remember that we just want to optimize the policy in the end, and the value function is indirectly giving supervision to the policy
  22. This idea leads to another strategy mode settings between value-based and polic-based As I said, we leaern RL algorithms basically come from the idea of this interactive updates of a plicy and a value (GPI) But as a special case of GPI, you can update only a value function (action value function) Or in other words, policy is updated as a part of an action value function What makes most RL curriculum confusing is, DP is first introduced to show the policy-based idea, but after that in practice we can introduce mainly value-based methods for a while That is simply because we need to introduced a lot of advanced ideas to totally separate a policy from the value
  23. And in fact until the 8th lecture we learn RL through the dissapointingly easyenvironments such as grid maps or state transit diagrams But rather than introducing deep learning ideas already from the beginning, this is going to be more efficient after studying supervised learning, unsupervised learning Just as supervised learning or unsupervised learning, as long as you know ther frameworks of training, which models to use is not big problem. So please be patient, and we prepared enough implementations to play around
  24. As you saw, I introduced some strategy model settings
  25. And also by using axis of value-policy or low-high-expressivity strategy settings, vairous practical algorithms can be classified like this
  26. Based on the tips for learning RL, our lectures are constructed like this.
  27. When you simply balancing of a bike with 5 states and only 3 actions, the MDP can be visualized like this.
  28. As I said earlier, you optimize two functions, value and policy A value function gives a value of the state, how likely it is to be in a state And policy is an action making rule only based on where you are
  29. The policy is updated to the direction of higher rewards. That means the policy is supervised by values, not always by an explicit reward
  30. And the value function is updated based on “experiences” You make a certain estimation of The learning comes later, and it is said to be close to neuroscience phenomena I’m not a neuro science expert, but I think you have experienced something similar in practicing sports, instruments or something When you have not mastered an action, in practicing you get an intuition that you would fail or get closer to a success. And after that you get a result. After repeating this, after a break or a sleep, you somehow master the action. RL is much more simplified, but this indirect supervision is a key of RL
  31. RL iterates these processes. Values are indirectly learned from ”experiences” and the values give supervision to the policy
  32. Today as an assignment, let’s wrap up the contents today by yourself.