I had to explain what reinforcement learning (RL) is to my colleagues, in 10 mins.
Given the time and targets, and tried to explain points with as little mathematical notations as possible.
Here’s the slide I used, and you could use it.
Or I’d appreciate some feedbacks.
The theme of the talk simple: you should stop saying “trial and errors” in the beginning of studying RL.
In my opinion the more important point is a value and a policy is updated interactively.
Without this point, you would just get lost in typical RL curriculum, where you start with dynamic programming, Q-learning.
The former is is a RL type without trial and errors, and the latter is a a case where a value and a policy is combined.
Actually other topics like Monte Carlo, TD, function approximation, exploring are crucial, but just options for making RL more diverse.
But anyway I’m also one in a process of studying this.
I’d appreciate feedback on my “study notes.”
https://data-science-blog.com/blog/2021/07/31/my-elaborate-study-notes-on-reinforcement-learning/
2. Table of Contents
• Theme of This Tech Talk: Stop Saying “Trial and Errors”
• Rough Definition of RL (*basic settings)
• Planning in Markov Decision Process (MDP)
• Interactive Optimization of Policies and Values
• Wrapping Up
3. Theme of This Tech Talk: Stop Saying “Trial and Errors“
With these charts,
you will miss the point in the beginning
4. From “Trial and Errors“ to Interactive Value-Policy Updates
Agent
Environment
Action
Reward
Value
Policy
This part should be
emphasized more
5. Table of Contents
• Theme of This Tech Talk: Stop Saying “Trial and Errors”
• Rough Definition of RL (*basic settings)
• Planning in Markov Decision Process (MDP)
• Interactive Optimization of Policies and Values
• Wrapping Up
6. Role of Reinforcement Learnig (RL) in AI
Machine learning
AI
Machine learning
Classical
models
Neural
networks
Supervised
learning
Unsupervised
learning
Reinforcement
learning
Models How to train
7. Rough Definition of RL: Planning Problem
• Sequential decision making: optimizing a sequence of actions
• Optimizing a “policy”: a “policy” means how to move in a given “state”
• Assuming Markov decision processes: next action only depends on the current state
Policy Action State
Example of planning: navigating a robot
8. Table of Contents
• Theme of This Tech Talk: Stop Saying “Trial and Errors”
• Rough Definition of RL (*basic settings)
• Planning in Markov Decision Process (MDP)
• Interactive Optimization of Policies and Values
• Wrapping Up
9. Markov Decision Process (MDP) in Some Expressions
Agent Env
Action
Reward
• Typical RL diagram
• State transition diagram • Backup diagram (closed)
• Graphical model
10. MDP: with an Example of Balancing a Bike
Or
State 0
State 1
State 2
State 3
State 4
Leaning left
No move
Leaning right
11. Plannign in MDP: Some Expressions
• Learning how to move optimally in each state
No move
Lean left
Lean right
12. Table of Contents
• Theme of This Tech Talk: Stop Saying “Trial and Errors”
• Rough Definition of RL (*basic settings)
• Planning in Markov Decision Process (MDP)
• Interactive Optimization of Policies and Values
• Wrapping UP
13. Values and Policies: with an Example of Balancing a Bike
• Value: how good it is to be in a state
• Policies: a probability of taking an action in a state
State 0:
minus reward
State 1:
low value
State 2:
high value
State 3:
low value
State 4:
minus reward
Action 0:
Low probability
Action 2:
High probability
Action 1
14. Policy updates
• Higher probability on actions to the direction of high values
State 0:
minus reward
State 1:
low value
State 2:
high value
Action 0:
leaning left
Action 1:
leaning right
Then how can a vlaue be learned?
Giving higher probability
15. Value update: Temporal Difference (TD) Learning
• TD learning: updating values by filling a gap between expectation
and actual rewards
If you lean left, the
values is low. As expected!
TD loss is low
Leaning right would
not be good because
value is low.
I was wrong.
There is no bad reward.
Let’s update the value.
TD loss is high
Learning could happen without explicit rewards
17. Table of Contents
• Theme of This Tech Talk: Stop Saying “Trial and Errors”
• Rough Definition of RL (*basic settings)
• Planning in Markov Decision Process (MDP)
• Interactive Optimization of Policies and Values
• Wrapping Up
18. Wrapping Up
• RL formulation: a planning problem by optimizing a policy
• Simple assumption of MDP : an action only depends on the current state
• Importance of a value: updating a policy by evaluating how good to be in
• TD learning: updating values by filling a gap between estimations on
values and actual rewards