Introduce to Reinforcement Learning

•Download as PPTX, PDF•

2 likes•130 views

The document presents an agenda for a talk on reinforcement learning and creating a bot to play FlappyBird. The agenda includes introducing reinforcement learning concepts like Markov decision processes, value functions, and deep Q-learning. It also demonstrates using OpenAI Gym to build a bot that learns to play FlappyBird through trial and error without being explicitly programmed.

Science

July 1, 2017
Create Bot to play FlappyBird
Introduce to Reinforcement
Learning
Nguyen Luong An Phu
anphunl@gmail.com

 What is Reinforcement Learning?
 Markov Decision Process
 Introduce OpenAI Gym
 Demo: Bot to play FlappyBird
Agenda

 No supervisor, only the reward signal.
 Feedback is delayed, not instantaneous.
 Sequential data, time is master.
 Agent’s actions affect the subsequent data it receives.
Difficulties of RL

Agent and Environment
ActionObservation
Reward
Ot
At
Rt

 History: O1, R1, A1, O2, R2, A2 ….At-1, Ot, Rt
 State is the information used to determine what happens next
 St = f(Ht)
 Agent state vs Environment state (Sa
t vs Se
t)
 Fully Observable and Partially Observable environment.
State

 Policy
Deterministic policy: a = π(s)
Stochastic policy: π(a|s) = P[At = a|St = s]
 Value function
vπ (s) = Eπ (Rt+1 + γRt+2 + γ2Rt+3 + … | St = s)
 Model
Pa
ss’ = P[St+1 = s’ | St = s, At = a]
Ra
s = E [Rt+1 | St = s, At = a]
Major components of an agent

 Value based
Value function
No policy (Implicit)
 Policy based
No value function
Policy
 Actor Critic
Value function
Policy
Categorizing RL agents

 Model free
Value function and/or policy
No model
 Model based
Value function and/or policy
Model
Categorizing RL agents

 Exploration finds more information about the environment
 Exploitation exploits known information to maximize reward
Exploration vs Exploitation
if np.random.uniform() < eps:
action = random_action()
else:
action = get_best_action()

 Markov state contains all useful information from the history.
 P[St+1 | St] = P[St+1 | S1,…, St]
 Some examples:
Se
t is Markov.
The history Ht is Markov.
Markov state (Information state)

 A Markov Decision Process is a tuple (S, A, P, R, γ).
 S: a finite set of states.
 A: a finite set of actions
 P: a state transition probability matrix
Pa
ss’ = P [St+1 = s’ | St = s, At = a]
 R: reward function
Ra
s = E [Rt+1 | St = s, At = a]
 γ: discount factor, γ ∈ [0, 1]
Markov Decision Process (MDP)

Example: Student MDP
Picture from David Silver’s course.

 The state-value function vπ(s) is the expected return
starting from state s, and then following policy π.
 The action-value function qπ(s, a) is the expected return
starting from state s, taking action a, and then following policy
π.
 vπ(s) = Eπ [Gt | St = s]
 qπ(s, a) = Eπ [Gt | St = s, At = a]
 Gt = Rt+1 + γRt+2 + γ2Rt+3 + …
Value function of MDP

Bellman Expectation Equation for vπ
Picture from David Silver’s course.

Bellman Expectation Equation for qπ
Picture from David Silver’s course.

State-Value Function for Student MDP
7.4 = 0.5 * (1 + 0.4*7.4 + 0.4*2.7 + 0.2*(-1.3)) + 0.5 * 10
Picture from David Silver’s course.

 State-value function
v∗(s) = maxπ vπ(s)
 Action-value function
q∗(s, a) = maxπ qπ(s, a)
 π* (a|s) =
1 𝑖𝑓 𝑎 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑎 𝑞∗(𝑠, 𝑎)
0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Optimal value function and policy

Bellman equation for optimal value function
Picture from David Silver’s course.

Optimal policy for Student MDP
Picture from David Silver’s course.

 Value Iteration
 Policy Iteration
 Q-learning
 Sarsa
 …
Solving the Bellman Optimality Equation

Deep Q-Learning
https://arxiv.org/pdf/1511.06581.pdf

Deep Q-Learning
http://neuro.cs.ut.ee/demystifying-deep-reinforcement-learning/

 https://www.coursera.org/learn/machine-learning
 https://www.coursera.org/learn/neural-networks
 NLP: https://web.stanford.edu/class/cs224n/
 CNN: http://cs231n.stanford.edu/
 RL: http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html
 http://www.deeplearningbook.org/
 Reinforcement Learning: An Introduction (Richard S. Sutton and
Andrew G. Barto)
Courses and books

Similar to Introduce to Reinforcement Learning

S19_lecture6_exploreexploitinbandits.pdfLPrashanthi

Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIJack Clark

Introduction to Reinforcement Learning for Molecular Design Dan Elton

PMED Undergraduate Workshop - Introduction to Reinforcement Learning - Lili W...The Statistical and Applied Mathematical Sciences Institute

Lec5 advanced-policy-gradient-methodsRonald Teo

Reinforcement Learning : A Beginners TutorialOmar Enayet

Reinforcement Learning with Amazon SageMaker RLThom Lane

CS294-112 Lec 05Gyubin Son

Lecture notesbutest

Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017MLconf

Financial Trading as a Game: A Deep Reinforcement Learning Approach謙益黃

100 things I knowr-uribe

lecture_21.pptx - PowerPoint Presentationbutest

Deep RL.pdfMohammadHosseinModir

RL.pptAzharJamil15

Cs221 logic-planningdarwinrlo

Cs221 lecture7-fall11darwinrlo

Imitation learning tutorialYisong Yue

TensorFlow and Deep Learning Tips and TricksBen Ball

Lecture notesbutest

Similar to Introduce to Reinforcement Learning (20)

S19_lecture6_exploreexploitinbandits.pdf

Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI

Introduction to Reinforcement Learning for Molecular Design

PMED Undergraduate Workshop - Introduction to Reinforcement Learning - Lili W...

Lec5 advanced-policy-gradient-methods

Reinforcement Learning : A Beginners Tutorial

Reinforcement Learning with Amazon SageMaker RL

CS294-112 Lec 05

Lecture notes

Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017

Financial Trading as a Game: A Deep Reinforcement Learning Approach

100 things I know

lecture_21.pptx - PowerPoint Presentation

Deep RL.pdf

RL.ppt

Cs221 logic-planning

Cs221 lecture7-fall11

Imitation learning tutorial

TensorFlow and Deep Learning Tips and Tricks

Lecture notes

Recently uploaded

Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani

Orientation, design and principles of polyhousejana861314

Botany krishna series 2nd semester Only Mcq type questionsSumit Kumar yadav

SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter

GBSN - Microbiology (Unit 1)Areesha Ahmad

GBSN - Biochemistry (Unit 1)Areesha Ahmad

Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav

Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6

PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani

Animal Communication- Auditory and Visual.pptxUmerFayaz5

9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service9953056974 Low Rate Call Girls In Saket, Delhi NCR

Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal

Engler and Prantl system of classification in plant taxonomyNistarini College, Purulia (W.B) India

All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani

Formation of low mass protostars and their circumstellar disksSérgio Sacani

Biological Classification BioHack (3).pdfmuntazimhurra

VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P

Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25

GFP in rDNA Technology (Biotechnology).pptxAleenaTreesaSaji

Nanoparticles synthesis and characterization kaibalyasahoo82800

Recently uploaded (20)

Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...

Orientation, design and principles of polyhouse

Botany krishna series 2nd semester Only Mcq type questions

SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx

GBSN - Microbiology (Unit 1)

GBSN - Biochemistry (Unit 1)

Botany 4th semester file By Sumit Kumar yadav.pdf

Biopesticide (2).pptx .This slides helps to know the different types of biop...

PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...

Animal Communication- Auditory and Visual.pptx

9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service

Spermiogenesis or Spermateleosis or metamorphosis of spermatid

Engler and Prantl system of classification in plant taxonomy

All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...

Formation of low mass protostars and their circumstellar disks

Biological Classification BioHack (3).pdf

VIRUSES structure and classification ppt by Dr.Prince C P

Recombination DNA Technology (Nucleic Acid Hybridization )

GFP in rDNA Technology (Biotechnology).pptx

Nanoparticles synthesis and characterization

Introduce to Reinforcement Learning

1. July 1, 2017 Create Bot to play FlappyBird Introduce to Reinforcement Learning Nguyen Luong An Phu anphunl@gmail.com

2.  What is Reinforcement Learning?  Markov Decision Process  Introduce OpenAI Gym  Demo: Bot to play FlappyBird Agenda

3. What is RL?

4. RL examples

5.  No supervisor, only the reward signal.  Feedback is delayed, not instantaneous.  Sequential data, time is master.  Agent’s actions affect the subsequent data it receives. Difficulties of RL

6. Agent and Environment ActionObservation Reward Ot At Rt

7.  History: O1, R1, A1, O2, R2, A2 ….At-1, Ot, Rt  State is the information used to determine what happens next  St = f(Ht)  Agent state vs Environment state (Sa t vs Se t)  Fully Observable and Partially Observable environment. State

8.  Policy Deterministic policy: a = π(s) Stochastic policy: π(a|s) = P[At = a|St = s]  Value function vπ (s) = Eπ (Rt+1 + γRt+2 + γ2Rt+3 + … | St = s)  Model Pa ss’ = P[St+1 = s’ | St = s, At = a] Ra s = E [Rt+1 | St = s, At = a] Major components of an agent

9.  Value based Value function No policy (Implicit)  Policy based No value function Policy  Actor Critic Value function Policy Categorizing RL agents

10.  Model free Value function and/or policy No model  Model based Value function and/or policy Model Categorizing RL agents

11.  Exploration finds more information about the environment  Exploitation exploits known information to maximize reward Exploration vs Exploitation if np.random.uniform() < eps: action = random_action() else: action = get_best_action()

12.  Markov state contains all useful information from the history.  P[St+1 | St] = P[St+1 | S1,…, St]  Some examples: Se t is Markov. The history Ht is Markov. Markov state (Information state)

13.  A Markov Decision Process is a tuple (S, A, P, R, γ).  S: a finite set of states.  A: a finite set of actions  P: a state transition probability matrix Pa ss’ = P [St+1 = s’ | St = s, At = a]  R: reward function Ra s = E [Rt+1 | St = s, At = a]  γ: discount factor, γ ∈ [0, 1] Markov Decision Process (MDP)

14. Example: Student MDP Picture from David Silver’s course.

15.  The state-value function vπ(s) is the expected return starting from state s, and then following policy π.  The action-value function qπ(s, a) is the expected return starting from state s, taking action a, and then following policy π.  vπ(s) = Eπ [Gt | St = s]  qπ(s, a) = Eπ [Gt | St = s, At = a]  Gt = Rt+1 + γRt+2 + γ2Rt+3 + … Value function of MDP

16. Bellman Expectation Equation for vπ Picture from David Silver’s course.

17. Bellman Expectation Equation for qπ Picture from David Silver’s course.

18. State-Value Function for Student MDP 7.4 = 0.5 * (1 + 0.4*7.4 + 0.4*2.7 + 0.2*(-1.3)) + 0.5 * 10 Picture from David Silver’s course.

19.  State-value function v∗(s) = maxπ vπ(s)  Action-value function q∗(s, a) = maxπ qπ(s, a)  π* (a|s) = 1 𝑖𝑓 𝑎 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑎 𝑞∗(𝑠, 𝑎) 0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 Optimal value function and policy

20. Bellman equation for optimal value function Picture from David Silver’s course.

21. Optimal policy for Student MDP Picture from David Silver’s course.

22.  Value Iteration  Policy Iteration  Q-learning  Sarsa  … Solving the Bellman Optimality Equation

23. Deep Q-Learning https://arxiv.org/pdf/1511.06581.pdf

24. Deep Q-Learning http://neuro.cs.ut.ee/demystifying-deep-reinforcement-learning/

25. Demo FlappyBird & Discussion

26.  https://www.coursera.org/learn/machine-learning  https://www.coursera.org/learn/neural-networks  NLP: https://web.stanford.edu/class/cs224n/  CNN: http://cs231n.stanford.edu/  RL: http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html  http://www.deeplearningbook.org/  Reinforcement Learning: An Introduction (Richard S. Sutton and Andrew G. Barto) Courses and books

Editor's Notes

Real world reinforcement learning: learn from experience to maximize the rewards. Dog watches the actions of the trainer, hears her command and react based on those information. If the react is good, dog receives rewards (lure, compliment…). If the react is not good, dog will not receive any reward. Dog will learn from its experience to find the way to get as many rewards as possible.
AlphaGo: defeated Ke Jie (Kha Khiết) (other game playing: Atari, chess…) Waymo: Self driving car (Google) DeepMind AI Reduces Google Data Centre Cooling Bill by 40% (https://goo.gl/JbcH5n) Robotics SpaceX reuses rocket. Financial (Investment)
Supervised learning, unsupervised learning? We usually don’t receive the reward immediately. When playing chess, we win or lose because of some moves in the past For the self driving car problem, right before the accident, driver often hits the brake. Observation -> action -> reward -> new observation -> new action -> new reward. The actions of agent can change the environment and affect to the future observation.
At step t: do action At, see new observation Ot and receive reward Rt
History is a series of observations, rewards and actions from the beginning to current time. State is a function of history. Env state is environment’s private representation, usually not visible to the agent. If it’s visible, it may contain the irrelevant information. In fully observable env, agent directly observes the environment. (Sa = Se) In particially observable env, agent indirectly observes env (Sa != St)
Policy is the agent’s behavior, it maps from state to action. Value function is a prediction of future reward, used to evaluate the goodness/badness of states  choose the action. A model predicts what the environment will do next P predict the next state R predict the next immediately reward. (not the Rt+1, just the expected value) If gamma = 0  just care about immediately reward, if gamma =1  don’t discount.
Categorizing : value based, policy based, actor critic
Categorizing : model free, model based
Reinforcement learning is like trial-and-error learning The agent discover the good policy from its experiences of the environment without losing too much reward along way. Reduce epsilon during training time. When at test mode, just choose the best action. Epsilon is a small number (1-> 0.1)
When the state is known, the history can be thrown away. Can convert or create the Markov state by adding more information. Some more examples: chess board and know the player will move next, drive a car -> just need to know the current conditions: position, speed…, don’t need to care about history.
Why do we need the gamma discount factor? The discount γ is the present value of future rewards Avoids infinite returns in cyclic Markov processes Uncertainty about the future Like the bank, the money today is better than tomorrow. Animal/human behavior shows preference for immediatereward
The example is from David Silver’s course. Circles and squares are states (square: terminal state) Some actions: Facebook, Quit, Study… From the 3rd state, if we chose action Pub, it may ends with different states.
From state s, we can do many action, the probability of each action is π(a|s) After that, we receive reward then it can move to other state s’ with the probability Pass’
From state s, we choose action a, receive reward Ras , then can move to many new states. After that, we can do many actions based on π(a’|s’)
The optimal state-value function v∗(s) is the maximum value function over all policies The optimal action-value function q∗(s, a) is the maximum action-value function over all policies An MDP is “solved” when we know the optimal value The optimal value function specifies the best possible performance in the MDP If we know q∗(s; a), we immediately have the optimal policy
Follow the q*, we will find the optimal policy
Input: state Output: vector for q value (size : nb_actions). Dueling DQN: the first is the value function V(s), which says simple how good it is to be in any given state. The second is the advantage function A(a), which tells how much better taking a certain action would be compared to the others. We can then think of Q as being the combination of V and A.

Introduce to Reinforcement Learning

Recommended

Recommended

More Related Content

Similar to Introduce to Reinforcement Learning

Similar to Introduce to Reinforcement Learning (20)

Recently uploaded

Recently uploaded (20)

Introduce to Reinforcement Learning

Editor's Notes