2. 2
Introduction
What is Reinforcement Learning(RL)?
Problems that RL focuses on
Control problem
Multi-armed bandit
Combinatorial optimization
Cooperative behavior learning
Competitive behavior learning
Mixed behavior learning
Learning from human experts
Learning from human feedback
Contents
4. 4
Definition and objective of RL
Type of machine learning technique that enables an agent to learn in an interactive environment by trial
and error using feedback from its action and experience
Agent aim to maximize expected return(sum of rewards)
• 𝜋∗ = 𝑎𝑟𝑔𝑚𝑎𝑥𝜋 𝔼𝜏~𝜌𝜋 𝜏 Σ𝑡=0
∞
𝑟𝑡 𝑠𝑡, 𝑎𝑡 𝜋
What is reinforcement learning(RL)
Components in RL
Agent: The learner and decision-maker in RL
Environment: The thing it interacts with, comprising everything
outside the agent
Step: Atomic environmental interactions.
Episode: Length of the simulation at the end of which the system
ends in a terminal state.
Data flow of the reinforcement learning
5. 5
Components in RL
Action 𝑎𝑡: All the possible moves that the agent can exert
State 𝑠𝑡: Current situation returned by the environment.
Reward 𝑟𝑡: An immediate return sent back from the environment to evaluate the last action.
Policy 𝜋𝜃: The strategy that the agent employs to determine the next action based on the current state.
• Policy 𝜋𝜃, parameterized with 𝜃 is a mapping from state space 𝕊 to action space 𝔸,
What is reinforcement learning(RL)
Data flow of the reinforcement learning
7. 7
Control problem
Description
Control of the object in a specific environment
RL can handle this problem about any level
• Perception decision-making control
– End-to-end control
• Decision-making control
– Decision and control
• Only control
Example problem
In the robot-arm domain, end-to-end control
problems have been studied with RL
In the autonomous vehicle domain, decision and
control problems have been studied with RL
Control problem Example problems
8. 8
Multi-armed bandit
Description
Selection of the action in a specific set
RL can handle this problem about any horizon
• Finite-horizon problem
• Infinite-horizon problem
Example problem
In the board game domain, the RL agent selects
the empty cell at every step-time
In the recommender system domain, the RL
agent suggests the item to the user at every
trigger time
In the computer science domain, the RL agent
assigns the job to the machine at every step-time
Multi-armed bandit problem Example problems
9. 9
Combinatorial optimization
Description
Multiple selections of the action in a
specific set
RL can handle this problem in one-step
Example problem
In the chip placement domain, the RL agent placement
semi-conductors in the empty wafer in just one-step
In the routing problem domain, the RL agent calculates
the order of the driving route in just one-step
In the math problem domain, the RL agent optimizes the
symbolic component or operation order
In the chemistry domain, the RL agent optimizes the
reaction process
Combinatorial optimization problem Example problems
10. 10
Cooperative behavior learning
Description
Control of the multi-objects in a specific
environment
RL can handle this problem in any setting
• Individual reward problem
• Team reward problem
Example problem
In the communication domain, the RL agent
distributes the resource for achieving the team
goal
In the game domain, the commander RL agent
controls the multiple units to achieve triumph
Cooperative behavior learning problem Example problems
11. 11
Competitive behavior learning
Description
Control of the multi-objects in a specific
environment
RL can handle this problem in zero-sum game
setting
Example problem
In the game domain, the RL agent learns the
competitive behavior in various games such as
Chess, Go, StarCraft II, and so on
Competitive behavior learning problem Example problems
12. 12
Mixed behavior learning
Description
Control of the multi-objects in a specific
environment
RL can handle this problem in general sum
game setting
• Cooperative behavior learning in the same group
• Competitive behavior learning between different
groups
Example problem
In the game domain, group battles have been
studied with RL
In the autonomous vehicle domain, the RL agent
controls the multiple autonomous vehicles in
mixed autonomy
Mixed behavior learning problem Example problems
13. 13
Learning from human experts
Description
Learning the agent from the demonstration
trajectories
RL can handle the complex problem
through human experts
• Problem that has complex rules, such as Go
• Problem that faces complex scenarios such
as autonomous vehicle driving
Example problem
In the autonomous vehicle domain, the RL agent
controls the autonomous vehicle in complex scenarios
In the finance domain, the RL agent determines the
buy/sell stocks in complex scenarios
In the game domain, the RL agent, which is
constructed with a robust neural network such as the
transformer, could handle multiple games(DeepMind
GATO)
Learning from human experts problem Example problems
14. 14
Learning from human feedback(preference)
Description
Learning the reward model of the agent from human
feedback(pos/neg), and then learning the policy of
the agent through the learned reward model
RL can handle the humanistic problem through
human feedback
• NLP problems that require humanistic feedback
• Problem that faces complex scenarios such as solving
the cube, autonomous vehicle driving
Example problem
In the robotics domain, the RL agent could
be learned by human feedback to solve the
cube(Open AI DAGGER)
In the NLP domain, the RL agent could be
learned by human feedback to involve human
values or preferences(Open AI ChatGPT)
• 혐오 발언 자제, 문맥의 자연스러움 등을 학습
Learning from human feedback problem Example problems