An Introduction to
Reinforcement Learning
Jie-Han Chen
NetDB, National Cheng Kung University
3/27, 2018 @ National Cheng Kung University, Taiwan
1
The content in this lecture were borrowed from:
1. Rich Sutton’s textbook
2. David Silver’s Reinforcement Learning class in UCL
3. Sergey Levine’s Deep Reinforcement Learning class in UCB
2
Disclamier
Syllabus
● Introduction to Reinforcement Learning
● Markov Decision Process
● Dynamic Programming
● Monte Carlo method
● Temporal Difference method
● Deep Reinforcement Learning
● Policy Gradient
● Hierarchical Reinforcement Learning and Multiagent Reinforcement Learning
● Active Research Issue
3
Resources
Textbooks:
● Reinforcement Learning: An Introduction, Sutton and Barto
● Algorithms for Reinforcement Learning, Szepesvari
Course:
● CS 294 Deep Reinforcement Learning, Berkeley
● David Silver’s Reinforcement Learning course, UCL
● CMU 10703 Deep Reinforcement Learning and Control, CMU
● Shan-Hung Wu’s Deep Learning course in NTHU
All of them are our reference materials in this lecture.
4
Outline
● Syllabus
● Introduction
● Elements of reinforcement learning and its objective
● History of RL
● Applications
● The challenge and active research fields in RL
● Research institute and notable researchers
5
Machine Learning
From David Silver’s RL course 6
Introduction to Reinforcement Learning
Reinforcement learning is a learning framework different from supervised learning
and unsupervised learning.
It is composed of series of perception and interaction between agent and
environment.
From Sutton’s book 7
Agent and Environment
At each step t the agent:
● Receives scalar reward Rt
● Receives observaiotn Ot
● Executes action At
The environment:
● Receives action At
● Emits observation Ot+1
● Emits scalar reward Rt+1
8
Introduction to Reinforcement Learning
Reinforcement Learning is often used to solve sequential decision problem.
● Goal: select actions to maximize total future reward
● Action may have long term consequences
● Reward may be delayed
● It may be better to sacrifice immediate reward to gain more long-term reward
● Eg:
○ A financial investiment
○ Chess game
9
Supervised Learning & Unsupervised Learning
The input data are independent (i.i.d).
Current output will not affect the next
input.
10
Reinforcement Learning
The agent’s action do affect the data
received in the future.
Figure from Wikipedia, made by waldoalvarez11
Introduction to Reinforcement Learning
● In reinforcement learning the
agent learns from trial and error.
● The better experience make the
agent learn better policy.
● What kind of experience is
better?
The image is from :
http://www.homemeeting.us/franktmc/maze_2.jpg
12
Elements of reinforcement learning
● Policy
● Reward signal
● Value function
● Model of environment (optional)
13
Elements of reinforcement learning - policy
Policy
● Define the learning agents’ way of behaving at a given time. Could be a
simple function or lookup table or search process
● Often denoted by
● Could be deterministic or stochastic
14
Elements of reinforcement learning - policy
If you are Russell Westbrook, and now
is defended by James Harden. With
this situation, you have 3 choices:
● Cut
● Shoot
● Pass
15
Stochastic policy
Probability
Action
16
Deterministic policy
Probability
Action
17
Policies - Action space
In reinforcement learning, we can categorize the problem by the action space into
2 types.
● Discrete action space
● Continuous action space
In previous example, the decision or the action are in discrete space, but there are
many example of continuous control, eg: robotic arm. The stochastic policy of
continuous control problem would like a probability density function.
18
Elements of reinforcement learning - reward
Reward: r / Rt
● Defines the goal in a reinforcement learning problem
● Indicates how well agent is doing at step t
● Immediately percepted from the environment
19
Elements of reinforcement learning - reward
+2
0 or -0.2?
20
Elements of reinforcement learning - reward
In chess or Go, the reward is defined
by its outcome.
● Win: +1
● Draw: 0
● Lose: -1
In most steps, we don’t receive any
reward(value = 0). It’s a kind of sparse
reward problem.
21
Elements of reinforcement learning - reward
If we want to reach the goal by less
steps, we often define the reward to
-1 when you take a step.
22
Elements of reinforcement learning - value function
Value function
● Indicates which decision is good in the long run.
● There are two forms:
○ state-value function
○ action-value function
● Unlike reward, value function is an estmated value.
23
Elements of reinforcement learning - value function
The game comes to 99 vs 98(our) and just
left 5 seconds to the end of the game.
Now, If you need to throw in in midfield,
which one would you pass the ball to?
1. 櫻木花道
2. 三井壽
24
Elements of reinforcement learning - model
Model of environments (optional)
● Use something to mimic the behavior of the environment.
● Allow inferences to be made about how the environment will behave.
(planning)
● Methods for solving reinforcement learning problems that use models for
planning are called model-based methods. The opposites are model-free
methods.
25
Elements of reinforcement learning - model
Interaction, inferences
Learn the model
The image is from David Silver’s RL course 26
Just like ...
27
Elements of reinforcement learning - model
28
Elements of reinforcement learning - model
29
Elements of reinforcement learning
● Policy
● Reward signal
● Value function
● Model of environment (optional)
30
The objective of reinforcement learning
Reinforcement learning is a framework
of goal directed learning.
The objective of reinforcement learning
is to maximize accumulative rewards in
each task.
The image is from:
https://www.wikijob.co.uk/content/interview-advice/competencies/decision-making31
History of Reinforcement Learning
Reinforcement Learning is inspired by two domain knowledge
● Optimal control
● Biological learning system: Animal learning
32
Optimal control
It is a mathematical optimization method for deriving control policies
especially under certain constraints.
The optimization method is largely due to the work of Lev Pontryagin and
Richard Bellman in the 1950s.
33
Richard Bellman
Richard Bellman was an applied
mathematician, who introduced dynamic
programming in 1953.
Work:
● Bellman Equation
● Curse of dimensionality
● Bellman-Ford algorithm
34
Animal Learning
● Teach dog - positive reward
35
Animal Learning
● Teach dog - penalty (negative reward)
36
Some question about RL
● Why do we need to learn Reinforcement Learning?
● What make Reinforcement Learning spring up like mushrooms?
37
Backgammon (IBM, 1992)
Temporal difference learning and TD-Gammon, by
Gerald Tesauro, 1992
Gammon is 雙陸棋 in Chinese.
source: from wikipedia
38
Autonomous Helicopter (Stanford, 2000)
The aerobatics fo helicopter has been studied from 2000 by Andrew Ng and
Pieter Abbeel in Stanford.
You can see more details on: http://heli.stanford.edu/39
Deep reinforcement learning in Atari game (2013)
Deep Q Network: proposed by V Mnih et al. It’s the first reinforcement learning
end-to-end model to combine deep learning with raw inputs.
40
Deep reinforcement learning in Atari game (2013)
41
Deep Reinforcement Learning for Robotic Manipulation
42
AlphaGo (DeepMind, 2016)
43
AlphaGo (DeepMind, 2016)
AlphaGo: David Silver, Aja Huang et al., use Monte Carlo Tree search (MCTS) and
deep reinforcement learning (policy gradient) to master the game of Go.
44
AlphaGo Zero (DeepMind, 2017)
AlphaGo Zero: David Silver et al., use MCTS and policy iteration with ResNet with
2-head architecture to learn from scratch without human knowledge.
45
46
AlphaGo Zero (DeepMind, 2017)
Dota2 (OpenAI, 2017)
● Beats the world’s top professionals at 1v1 matches
● The bot learned from scratch by self-play
47
Dota2 (OpenAI, 2017)
48
Dota2 (OpenAI, 2017)
49
Alibaba (Starcraft1, multiagent)
50
Deep RL for Dialogue Generation (Li et al., 2016)
● RL agent generates more interactive responses
● RL agent tends to end a sentence with a question and hand the conversation
over to the user
● Next step: explore intrinsic rewards, large-scale training
From the slides on http://opendialogue.miulab.tw51
The Challenge of reinforcement learning
● Sparse reward issue
● Reward credit assignment
● Large space for exploration (trial-and-error)
● Imperfect information, partial observation
52
Active research domain
● Multiagent reinforcement learning
● Hierarchical reinforcement learning
● Inverse reinforcement learning
● Multi-task Transfer learning in reinforcement learning
● Meta learning
● One-shot reinforcement learning
● Deep reinforcement learning in dialogue generation
53
Research institute and notable researchers
54
The research scientists in RL you must know!
● Richard S. Sutton
● David Silver
● Pieter Abbeel
● Sergey Levine
55
Richard S. Sutton
● The founding father of reinforcement
learning
● Professor of Computer Science at University
of Alberta
● Temporal difference learning
● Dyna architecture
56
David Silver
● The research scientist in DeepMind
● Lead researcher on AlphaGo and AlphaGo
Zero team
● Supervised by Sutton in Ph.D
● A professor in University College London
before
57
Pieter Abbeel
● Professor in UC Berkeley
● Director of the UC Berkeley Robot Learning Lab
● Research scientist and advisor in OpenAI
58
Sergey Levine
● Assistant Professor in UC Berkeley
● Research scientist in Google Brain
● Autonomous robots
59
Question?
60

An introduction to reinforcement learning

  • 1.
    An Introduction to ReinforcementLearning Jie-Han Chen NetDB, National Cheng Kung University 3/27, 2018 @ National Cheng Kung University, Taiwan 1
  • 2.
    The content inthis lecture were borrowed from: 1. Rich Sutton’s textbook 2. David Silver’s Reinforcement Learning class in UCL 3. Sergey Levine’s Deep Reinforcement Learning class in UCB 2 Disclamier
  • 3.
    Syllabus ● Introduction toReinforcement Learning ● Markov Decision Process ● Dynamic Programming ● Monte Carlo method ● Temporal Difference method ● Deep Reinforcement Learning ● Policy Gradient ● Hierarchical Reinforcement Learning and Multiagent Reinforcement Learning ● Active Research Issue 3
  • 4.
    Resources Textbooks: ● Reinforcement Learning:An Introduction, Sutton and Barto ● Algorithms for Reinforcement Learning, Szepesvari Course: ● CS 294 Deep Reinforcement Learning, Berkeley ● David Silver’s Reinforcement Learning course, UCL ● CMU 10703 Deep Reinforcement Learning and Control, CMU ● Shan-Hung Wu’s Deep Learning course in NTHU All of them are our reference materials in this lecture. 4
  • 5.
    Outline ● Syllabus ● Introduction ●Elements of reinforcement learning and its objective ● History of RL ● Applications ● The challenge and active research fields in RL ● Research institute and notable researchers 5
  • 6.
    Machine Learning From DavidSilver’s RL course 6
  • 7.
    Introduction to ReinforcementLearning Reinforcement learning is a learning framework different from supervised learning and unsupervised learning. It is composed of series of perception and interaction between agent and environment. From Sutton’s book 7
  • 8.
    Agent and Environment Ateach step t the agent: ● Receives scalar reward Rt ● Receives observaiotn Ot ● Executes action At The environment: ● Receives action At ● Emits observation Ot+1 ● Emits scalar reward Rt+1 8
  • 9.
    Introduction to ReinforcementLearning Reinforcement Learning is often used to solve sequential decision problem. ● Goal: select actions to maximize total future reward ● Action may have long term consequences ● Reward may be delayed ● It may be better to sacrifice immediate reward to gain more long-term reward ● Eg: ○ A financial investiment ○ Chess game 9
  • 10.
    Supervised Learning &Unsupervised Learning The input data are independent (i.i.d). Current output will not affect the next input. 10
  • 11.
    Reinforcement Learning The agent’saction do affect the data received in the future. Figure from Wikipedia, made by waldoalvarez11
  • 12.
    Introduction to ReinforcementLearning ● In reinforcement learning the agent learns from trial and error. ● The better experience make the agent learn better policy. ● What kind of experience is better? The image is from : http://www.homemeeting.us/franktmc/maze_2.jpg 12
  • 13.
    Elements of reinforcementlearning ● Policy ● Reward signal ● Value function ● Model of environment (optional) 13
  • 14.
    Elements of reinforcementlearning - policy Policy ● Define the learning agents’ way of behaving at a given time. Could be a simple function or lookup table or search process ● Often denoted by ● Could be deterministic or stochastic 14
  • 15.
    Elements of reinforcementlearning - policy If you are Russell Westbrook, and now is defended by James Harden. With this situation, you have 3 choices: ● Cut ● Shoot ● Pass 15
  • 16.
  • 17.
  • 18.
    Policies - Actionspace In reinforcement learning, we can categorize the problem by the action space into 2 types. ● Discrete action space ● Continuous action space In previous example, the decision or the action are in discrete space, but there are many example of continuous control, eg: robotic arm. The stochastic policy of continuous control problem would like a probability density function. 18
  • 19.
    Elements of reinforcementlearning - reward Reward: r / Rt ● Defines the goal in a reinforcement learning problem ● Indicates how well agent is doing at step t ● Immediately percepted from the environment 19
  • 20.
    Elements of reinforcementlearning - reward +2 0 or -0.2? 20
  • 21.
    Elements of reinforcementlearning - reward In chess or Go, the reward is defined by its outcome. ● Win: +1 ● Draw: 0 ● Lose: -1 In most steps, we don’t receive any reward(value = 0). It’s a kind of sparse reward problem. 21
  • 22.
    Elements of reinforcementlearning - reward If we want to reach the goal by less steps, we often define the reward to -1 when you take a step. 22
  • 23.
    Elements of reinforcementlearning - value function Value function ● Indicates which decision is good in the long run. ● There are two forms: ○ state-value function ○ action-value function ● Unlike reward, value function is an estmated value. 23
  • 24.
    Elements of reinforcementlearning - value function The game comes to 99 vs 98(our) and just left 5 seconds to the end of the game. Now, If you need to throw in in midfield, which one would you pass the ball to? 1. 櫻木花道 2. 三井壽 24
  • 25.
    Elements of reinforcementlearning - model Model of environments (optional) ● Use something to mimic the behavior of the environment. ● Allow inferences to be made about how the environment will behave. (planning) ● Methods for solving reinforcement learning problems that use models for planning are called model-based methods. The opposites are model-free methods. 25
  • 26.
    Elements of reinforcementlearning - model Interaction, inferences Learn the model The image is from David Silver’s RL course 26
  • 27.
  • 28.
    Elements of reinforcementlearning - model 28
  • 29.
    Elements of reinforcementlearning - model 29
  • 30.
    Elements of reinforcementlearning ● Policy ● Reward signal ● Value function ● Model of environment (optional) 30
  • 31.
    The objective ofreinforcement learning Reinforcement learning is a framework of goal directed learning. The objective of reinforcement learning is to maximize accumulative rewards in each task. The image is from: https://www.wikijob.co.uk/content/interview-advice/competencies/decision-making31
  • 32.
    History of ReinforcementLearning Reinforcement Learning is inspired by two domain knowledge ● Optimal control ● Biological learning system: Animal learning 32
  • 33.
    Optimal control It isa mathematical optimization method for deriving control policies especially under certain constraints. The optimization method is largely due to the work of Lev Pontryagin and Richard Bellman in the 1950s. 33
  • 34.
    Richard Bellman Richard Bellmanwas an applied mathematician, who introduced dynamic programming in 1953. Work: ● Bellman Equation ● Curse of dimensionality ● Bellman-Ford algorithm 34
  • 35.
    Animal Learning ● Teachdog - positive reward 35
  • 36.
    Animal Learning ● Teachdog - penalty (negative reward) 36
  • 37.
    Some question aboutRL ● Why do we need to learn Reinforcement Learning? ● What make Reinforcement Learning spring up like mushrooms? 37
  • 38.
    Backgammon (IBM, 1992) Temporaldifference learning and TD-Gammon, by Gerald Tesauro, 1992 Gammon is 雙陸棋 in Chinese. source: from wikipedia 38
  • 39.
    Autonomous Helicopter (Stanford,2000) The aerobatics fo helicopter has been studied from 2000 by Andrew Ng and Pieter Abbeel in Stanford. You can see more details on: http://heli.stanford.edu/39
  • 40.
    Deep reinforcement learningin Atari game (2013) Deep Q Network: proposed by V Mnih et al. It’s the first reinforcement learning end-to-end model to combine deep learning with raw inputs. 40
  • 41.
    Deep reinforcement learningin Atari game (2013) 41
  • 42.
    Deep Reinforcement Learningfor Robotic Manipulation 42
  • 43.
  • 44.
    AlphaGo (DeepMind, 2016) AlphaGo:David Silver, Aja Huang et al., use Monte Carlo Tree search (MCTS) and deep reinforcement learning (policy gradient) to master the game of Go. 44
  • 45.
    AlphaGo Zero (DeepMind,2017) AlphaGo Zero: David Silver et al., use MCTS and policy iteration with ResNet with 2-head architecture to learn from scratch without human knowledge. 45
  • 46.
  • 47.
    Dota2 (OpenAI, 2017) ●Beats the world’s top professionals at 1v1 matches ● The bot learned from scratch by self-play 47
  • 48.
  • 49.
  • 50.
  • 51.
    Deep RL forDialogue Generation (Li et al., 2016) ● RL agent generates more interactive responses ● RL agent tends to end a sentence with a question and hand the conversation over to the user ● Next step: explore intrinsic rewards, large-scale training From the slides on http://opendialogue.miulab.tw51
  • 52.
    The Challenge ofreinforcement learning ● Sparse reward issue ● Reward credit assignment ● Large space for exploration (trial-and-error) ● Imperfect information, partial observation 52
  • 53.
    Active research domain ●Multiagent reinforcement learning ● Hierarchical reinforcement learning ● Inverse reinforcement learning ● Multi-task Transfer learning in reinforcement learning ● Meta learning ● One-shot reinforcement learning ● Deep reinforcement learning in dialogue generation 53
  • 54.
    Research institute andnotable researchers 54
  • 55.
    The research scientistsin RL you must know! ● Richard S. Sutton ● David Silver ● Pieter Abbeel ● Sergey Levine 55
  • 56.
    Richard S. Sutton ●The founding father of reinforcement learning ● Professor of Computer Science at University of Alberta ● Temporal difference learning ● Dyna architecture 56
  • 57.
    David Silver ● Theresearch scientist in DeepMind ● Lead researcher on AlphaGo and AlphaGo Zero team ● Supervised by Sutton in Ph.D ● A professor in University College London before 57
  • 58.
    Pieter Abbeel ● Professorin UC Berkeley ● Director of the UC Berkeley Robot Learning Lab ● Research scientist and advisor in OpenAI 58
  • 59.
    Sergey Levine ● AssistantProfessor in UC Berkeley ● Research scientist in Google Brain ● Autonomous robots 59
  • 60.