SlideShare a Scribd company logo
July 2, 2017
Create Bot to play FlappyBird
Introduce to Reinforcement
Learning
 What is Reinforcement Learning?
 Markov Decision Process
 Introduce OpenAI Gym
 Demo: Bot to play FlappyBird
Agenda
What is RL?
RL examples
 No supervisor, only the reward signal.
 Feedback is delayed, not instantaneous.
 Sequential data, time is master.
 Agent’s actions affect the subsequent data it receives.
Difficulties of RL
Agent and Environment
ActionObservation
Reward
Ot
At
Rt
 History: O1, R1, A1, O2, R2, A2 ….At-1, Ot, Rt
 State is the information used to determine what happens next
 St = f(Ht)
 Agent state vs Environment state (Sa
t vs Se
t)
 Fully Observable and Partially Observable environment.
State
 Policy
Deterministic policy: a = π(s)
Stochastic policy: π(a|s) = P[At = a|St = s]
 Value function
vπ (s) = Eπ (Rt+1 + γRt+2 + γ2Rt+3 + … | St = s)
 Model
Pa
ss’ = P[St+1 = s’ | St = s, At = a]
Ra
s = E [Rt+1 | St = s, At = a]
Major components of an agent
 Value based
Value function
No policy (Implicit)
 Policy based
No value function
Policy
 Actor Critic
Value function
Policy
Categorizing RL agents
 Model free
Value function and/or policy
No model
 Model based
Value function and/or policy
Model
Categorizing RL agents
 Exploration finds more information about the environment
 Exploitation exploits known information to maximize reward
Exploration vs Exploitation
if np.random.uniform() < eps:
action = random_action()
else:
action = get_best_action()
 Markov state contains all useful information from the history.
 P[St+1 | St] = P[St+1 | S1,…, St]
 Some examples:
Se
t is Markov.
The history Ht is Markov.
Markov state (Information state)
 A Markov Decision Process is a tuple (S, A, P, R, γ).
 S: a finite set of states.
 A: a finite set of actions
 P: a state transition probability matrix
Pa
ss’ = P [St+1 = s’ | St = s, At = a]
 R: reward function
Ra
s = E [Rt+1 | St = s, At = a]
 γ: discount factor, γ ∈ [0, 1]
Markov Decision Process (MDP)
Example: Student MDP
 The state-value function vπ(s) is the expected return
starting from state s, and then following policy π.
 The action-value function qπ(s, a) is the expected return
starting from state s, taking action a, and then following policy
π.
 vπ(s) = Eπ [Gt | St = s]
 qπ(s, a) = Eπ [Gt | St = s, At = a]
 Gt = Rt+1 + γRt+2 + γ2Rt+3 + …
Value function of MDP
Bellman Expectation Equation for vπ
Bellman Expectation Equation for qπ
State-Value Function for Student MDP
7.4 = 0.5 * (1 + 0.4*7.4 + 0.4*2.7 + 0.2*(-1.3)) + 0.5 * 10
 State-value function
v∗(s) = maxπ vπ(s)
 Action-value function
q∗(s, a) = maxπ qπ(s, a)
 π* (a|s) =
1 𝑖𝑓 𝑎 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑎 𝑞∗(𝑠, 𝑎)
0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Optimal value function and policy
Bellman equation for optimal value function
Optimal policy for Student MDP
 Value Iteration
 Policy Iteration
 Q-learning
 Sarsa
 …
Solving the Bellman Optimality Equation
Deep Q-Learning
Deep Q-Learning
Demo FlappyBird & Discussion
 https://www.coursera.org/learn/machine-learning
 https://www.coursera.org/learn/neural-networks
 NLP: https://web.stanford.edu/class/cs224n/
 CNN: http://cs231n.stanford.edu/
 RL: http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html
 http://www.deeplearningbook.org/
 Reinforcement Learning: An Introduction (Richard S. Sutton and
Andrew G. Barto)
Courses and books

More Related Content

Similar to Introduction to reinforcement learning - Phu Nguyen

Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Jack Clark
 
Intro rl
Intro rlIntro rl
Intro rl
Ronald Teo
 
CS294-112 Lec 05
CS294-112 Lec 05CS294-112 Lec 05
CS294-112 Lec 05
Gyubin Son
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners Tutorial
Omar Enayet
 
Lec5 advanced-policy-gradient-methods
Lec5 advanced-policy-gradient-methodsLec5 advanced-policy-gradient-methods
Lec5 advanced-policy-gradient-methods
Ronald Teo
 
RL unit 5 part 1.pdf
RL unit 5 part 1.pdfRL unit 5 part 1.pdf
RL unit 5 part 1.pdf
ChandanaVemulapalli2
 
100 things I know
100 things I know100 things I know
100 things I know
r-uribe
 
Policy-Gradient for deep reinforcement learning.pdf
Policy-Gradient for  deep reinforcement learning.pdfPolicy-Gradient for  deep reinforcement learning.pdf
Policy-Gradient for deep reinforcement learning.pdf
21522733
 
Financial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning ApproachFinancial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning Approach
謙益 黃
 
RL_Dr.SNR Final ppt for Presentation 28.05.2021.pptx
RL_Dr.SNR Final ppt for Presentation 28.05.2021.pptxRL_Dr.SNR Final ppt for Presentation 28.05.2021.pptx
RL_Dr.SNR Final ppt for Presentation 28.05.2021.pptx
deeplearning6
 
An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)
pauldix
 
lecture_21.pptx - PowerPoint Presentation
lecture_21.pptx - PowerPoint Presentationlecture_21.pptx - PowerPoint Presentation
lecture_21.pptx - PowerPoint Presentation
butest
 
PMED Undergraduate Workshop - Introduction to Reinforcement Learning - Lili W...
PMED Undergraduate Workshop - Introduction to Reinforcement Learning - Lili W...PMED Undergraduate Workshop - Introduction to Reinforcement Learning - Lili W...
PMED Undergraduate Workshop - Introduction to Reinforcement Learning - Lili W...
The Statistical and Applied Mathematical Sciences Institute
 
Introduction to Reinforcement Learning for Molecular Design
Introduction to Reinforcement Learning for Molecular Design Introduction to Reinforcement Learning for Molecular Design
Introduction to Reinforcement Learning for Molecular Design
Dan Elton
 
Partially observable Markov decision processes for spoken dialog systems
Partially observable Markov decision processes for spoken dialog systemsPartially observable Markov decision processes for spoken dialog systems
Partially observable Markov decision processes for spoken dialog systems
Martin Majlis
 
Lecture3-MDP.pdf
Lecture3-MDP.pdfLecture3-MDP.pdf
Lecture3-MDP.pdf
NistalaPraneeth
 
AI - Introduction to Markov Principles
AI - Introduction to Markov PrinciplesAI - Introduction to Markov Principles
AI - Introduction to Markov Principles
Andrew Ferlitsch
 
Deep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-LearningDeep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-Learning
Kai-Wen Zhao
 
Reinforcement Learning Overview | Marco Del Pra
Reinforcement Learning Overview | Marco Del PraReinforcement Learning Overview | Marco Del Pra
Reinforcement Learning Overview | Marco Del Pra
Data Science Milan
 
Practical Reinforcement Learning with TensorFlow
Practical Reinforcement Learning with TensorFlowPractical Reinforcement Learning with TensorFlow
Practical Reinforcement Learning with TensorFlow
Illia Polosukhin
 

Similar to Introduction to reinforcement learning - Phu Nguyen (20)

Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
 
Intro rl
Intro rlIntro rl
Intro rl
 
CS294-112 Lec 05
CS294-112 Lec 05CS294-112 Lec 05
CS294-112 Lec 05
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners Tutorial
 
Lec5 advanced-policy-gradient-methods
Lec5 advanced-policy-gradient-methodsLec5 advanced-policy-gradient-methods
Lec5 advanced-policy-gradient-methods
 
RL unit 5 part 1.pdf
RL unit 5 part 1.pdfRL unit 5 part 1.pdf
RL unit 5 part 1.pdf
 
100 things I know
100 things I know100 things I know
100 things I know
 
Policy-Gradient for deep reinforcement learning.pdf
Policy-Gradient for  deep reinforcement learning.pdfPolicy-Gradient for  deep reinforcement learning.pdf
Policy-Gradient for deep reinforcement learning.pdf
 
Financial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning ApproachFinancial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning Approach
 
RL_Dr.SNR Final ppt for Presentation 28.05.2021.pptx
RL_Dr.SNR Final ppt for Presentation 28.05.2021.pptxRL_Dr.SNR Final ppt for Presentation 28.05.2021.pptx
RL_Dr.SNR Final ppt for Presentation 28.05.2021.pptx
 
An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)
 
lecture_21.pptx - PowerPoint Presentation
lecture_21.pptx - PowerPoint Presentationlecture_21.pptx - PowerPoint Presentation
lecture_21.pptx - PowerPoint Presentation
 
PMED Undergraduate Workshop - Introduction to Reinforcement Learning - Lili W...
PMED Undergraduate Workshop - Introduction to Reinforcement Learning - Lili W...PMED Undergraduate Workshop - Introduction to Reinforcement Learning - Lili W...
PMED Undergraduate Workshop - Introduction to Reinforcement Learning - Lili W...
 
Introduction to Reinforcement Learning for Molecular Design
Introduction to Reinforcement Learning for Molecular Design Introduction to Reinforcement Learning for Molecular Design
Introduction to Reinforcement Learning for Molecular Design
 
Partially observable Markov decision processes for spoken dialog systems
Partially observable Markov decision processes for spoken dialog systemsPartially observable Markov decision processes for spoken dialog systems
Partially observable Markov decision processes for spoken dialog systems
 
Lecture3-MDP.pdf
Lecture3-MDP.pdfLecture3-MDP.pdf
Lecture3-MDP.pdf
 
AI - Introduction to Markov Principles
AI - Introduction to Markov PrinciplesAI - Introduction to Markov Principles
AI - Introduction to Markov Principles
 
Deep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-LearningDeep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-Learning
 
Reinforcement Learning Overview | Marco Del Pra
Reinforcement Learning Overview | Marco Del PraReinforcement Learning Overview | Marco Del Pra
Reinforcement Learning Overview | Marco Del Pra
 
Practical Reinforcement Learning with TensorFlow
Practical Reinforcement Learning with TensorFlowPractical Reinforcement Learning with TensorFlow
Practical Reinforcement Learning with TensorFlow
 

More from Tu Le Dinh

Deep dive into Google Cloud for Big Data
Deep dive into Google Cloud for Big DataDeep dive into Google Cloud for Big Data
Deep dive into Google Cloud for Big Data
Tu Le Dinh
 
Progressive web apps - Linh Nguyen
Progressive web apps  - Linh NguyenProgressive web apps  - Linh Nguyen
Progressive web apps - Linh Nguyen
Tu Le Dinh
 
How to build virtual assistant like Jarvis (in Ironman) with Google Assistant...
How to build virtual assistant like Jarvis (in Ironman) with Google Assistant...How to build virtual assistant like Jarvis (in Ironman) with Google Assistant...
How to build virtual assistant like Jarvis (in Ironman) with Google Assistant...
Tu Le Dinh
 
The potential of chatbot - Why NLP is important for chatbot - Duc Nguyen
The potential of chatbot - Why NLP is important for chatbot - Duc NguyenThe potential of chatbot - Why NLP is important for chatbot - Duc Nguyen
The potential of chatbot - Why NLP is important for chatbot - Duc Nguyen
Tu Le Dinh
 
UI, UX: Who Does What? Where?-Vu Hoang
UI, UX: Who Does What? Where?-Vu HoangUI, UX: Who Does What? Where?-Vu Hoang
UI, UX: Who Does What? Where?-Vu Hoang
Tu Le Dinh
 
Welcome remark from GDG mien trung
Welcome remark from GDG mien trungWelcome remark from GDG mien trung
Welcome remark from GDG mien trung
Tu Le Dinh
 
Google developer experts program - Hieu Hua
Google developer experts program - Hieu HuaGoogle developer experts program - Hieu Hua
Google developer experts program - Hieu Hua
Tu Le Dinh
 
Android Architecture - Khoa Tran
Android Architecture -  Khoa TranAndroid Architecture -  Khoa Tran
Android Architecture - Khoa Tran
Tu Le Dinh
 
Zero to one with Android Things - Hieu Hua
Zero to one with Android Things - Hieu HuaZero to one with Android Things - Hieu Hua
Zero to one with Android Things - Hieu Hua
Tu Le Dinh
 

More from Tu Le Dinh (9)

Deep dive into Google Cloud for Big Data
Deep dive into Google Cloud for Big DataDeep dive into Google Cloud for Big Data
Deep dive into Google Cloud for Big Data
 
Progressive web apps - Linh Nguyen
Progressive web apps  - Linh NguyenProgressive web apps  - Linh Nguyen
Progressive web apps - Linh Nguyen
 
How to build virtual assistant like Jarvis (in Ironman) with Google Assistant...
How to build virtual assistant like Jarvis (in Ironman) with Google Assistant...How to build virtual assistant like Jarvis (in Ironman) with Google Assistant...
How to build virtual assistant like Jarvis (in Ironman) with Google Assistant...
 
The potential of chatbot - Why NLP is important for chatbot - Duc Nguyen
The potential of chatbot - Why NLP is important for chatbot - Duc NguyenThe potential of chatbot - Why NLP is important for chatbot - Duc Nguyen
The potential of chatbot - Why NLP is important for chatbot - Duc Nguyen
 
UI, UX: Who Does What? Where?-Vu Hoang
UI, UX: Who Does What? Where?-Vu HoangUI, UX: Who Does What? Where?-Vu Hoang
UI, UX: Who Does What? Where?-Vu Hoang
 
Welcome remark from GDG mien trung
Welcome remark from GDG mien trungWelcome remark from GDG mien trung
Welcome remark from GDG mien trung
 
Google developer experts program - Hieu Hua
Google developer experts program - Hieu HuaGoogle developer experts program - Hieu Hua
Google developer experts program - Hieu Hua
 
Android Architecture - Khoa Tran
Android Architecture -  Khoa TranAndroid Architecture -  Khoa Tran
Android Architecture - Khoa Tran
 
Zero to one with Android Things - Hieu Hua
Zero to one with Android Things - Hieu HuaZero to one with Android Things - Hieu Hua
Zero to one with Android Things - Hieu Hua
 

Recently uploaded

Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
Edge AI and Vision Alliance
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
IndexBug
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
Zilliz
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 

Recently uploaded (20)

Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 

Introduction to reinforcement learning - Phu Nguyen

  • 1. July 2, 2017 Create Bot to play FlappyBird Introduce to Reinforcement Learning
  • 2.  What is Reinforcement Learning?  Markov Decision Process  Introduce OpenAI Gym  Demo: Bot to play FlappyBird Agenda
  • 5.  No supervisor, only the reward signal.  Feedback is delayed, not instantaneous.  Sequential data, time is master.  Agent’s actions affect the subsequent data it receives. Difficulties of RL
  • 7.  History: O1, R1, A1, O2, R2, A2 ….At-1, Ot, Rt  State is the information used to determine what happens next  St = f(Ht)  Agent state vs Environment state (Sa t vs Se t)  Fully Observable and Partially Observable environment. State
  • 8.  Policy Deterministic policy: a = π(s) Stochastic policy: π(a|s) = P[At = a|St = s]  Value function vπ (s) = Eπ (Rt+1 + γRt+2 + γ2Rt+3 + … | St = s)  Model Pa ss’ = P[St+1 = s’ | St = s, At = a] Ra s = E [Rt+1 | St = s, At = a] Major components of an agent
  • 9.  Value based Value function No policy (Implicit)  Policy based No value function Policy  Actor Critic Value function Policy Categorizing RL agents
  • 10.  Model free Value function and/or policy No model  Model based Value function and/or policy Model Categorizing RL agents
  • 11.  Exploration finds more information about the environment  Exploitation exploits known information to maximize reward Exploration vs Exploitation if np.random.uniform() < eps: action = random_action() else: action = get_best_action()
  • 12.  Markov state contains all useful information from the history.  P[St+1 | St] = P[St+1 | S1,…, St]  Some examples: Se t is Markov. The history Ht is Markov. Markov state (Information state)
  • 13.  A Markov Decision Process is a tuple (S, A, P, R, γ).  S: a finite set of states.  A: a finite set of actions  P: a state transition probability matrix Pa ss’ = P [St+1 = s’ | St = s, At = a]  R: reward function Ra s = E [Rt+1 | St = s, At = a]  γ: discount factor, γ ∈ [0, 1] Markov Decision Process (MDP)
  • 15.  The state-value function vπ(s) is the expected return starting from state s, and then following policy π.  The action-value function qπ(s, a) is the expected return starting from state s, taking action a, and then following policy π.  vπ(s) = Eπ [Gt | St = s]  qπ(s, a) = Eπ [Gt | St = s, At = a]  Gt = Rt+1 + γRt+2 + γ2Rt+3 + … Value function of MDP
  • 18. State-Value Function for Student MDP 7.4 = 0.5 * (1 + 0.4*7.4 + 0.4*2.7 + 0.2*(-1.3)) + 0.5 * 10
  • 19.  State-value function v∗(s) = maxπ vπ(s)  Action-value function q∗(s, a) = maxπ qπ(s, a)  π* (a|s) = 1 𝑖𝑓 𝑎 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑎 𝑞∗(𝑠, 𝑎) 0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 Optimal value function and policy
  • 20. Bellman equation for optimal value function
  • 21. Optimal policy for Student MDP
  • 22.  Value Iteration  Policy Iteration  Q-learning  Sarsa  … Solving the Bellman Optimality Equation
  • 25. Demo FlappyBird & Discussion
  • 26.  https://www.coursera.org/learn/machine-learning  https://www.coursera.org/learn/neural-networks  NLP: https://web.stanford.edu/class/cs224n/  CNN: http://cs231n.stanford.edu/  RL: http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html  http://www.deeplearningbook.org/  Reinforcement Learning: An Introduction (Richard S. Sutton and Andrew G. Barto) Courses and books

Editor's Notes

  1. Real world reinforcement learning: learn from experience to maximize the rewards. Dog watches the actions of the trainer, hears her command and react based on those information. If the react is good, dog receives rewards (lure, compliment…). If the react is not good, dog will not receive any reward. Dog will learn from its experience to find the way to get as many rewards as possible.
  2. AlphaGo: defeated Ke Jie (Kha Khiết) (other game playing: Atari, chess…) Waymo: Self driving car (Google) DeepMind AI Reduces Google Data Centre Cooling Bill by 40% (https://goo.gl/JbcH5n) Robotics SpaceX reuses rocket. Financial (Investment)
  3. Supervised learning, unsupervised learning? We usually don’t receive the reward immediately. When playing chess, we win or lose because of some moves in the past For the self driving car problem, right before the accident, driver often hits the brake. Observation -> action -> reward -> new observation -> new action -> new reward. The actions of agent can change the environment and affect to the future observation.
  4. At step t: do action At, see new observation Ot and receive reward Rt
  5. History is a series of observations, rewards and actions from the beginning to current time. State is a function of history. Env state is environment’s private representation, usually not visible to the agent. If it’s visible, it may contain the irrelevant information. In fully observable env, agent directly observes the environment. (Sa = Se) In particially observable env, agent indirectly observes env (Sa != St)
  6. Policy is the agent’s behavior, it maps from state to action. Value function is a prediction of future reward, used to evaluate the goodness/badness of states  choose the action. A model predicts what the environment will do next P predict the next state R predict the next immediately reward. (not the Rt+1, just the expected value) If gamma = 0  just care about immediately reward, if gamma =1  don’t discount.
  7. Categorizing : value based, policy based, actor critic
  8. Categorizing : model free, model based
  9. Reinforcement learning is like trial-and-error learning The agent discover the good policy from its experiences of the environment without losing too much reward along way. Reduce epsilon during training time. When at test mode, just choose the best action. Epsilon is a small number (1-> 0.1)
  10. When the state is known, the history can be thrown away. Can convert or create the Markov state by adding more information. Some more examples: chess board and know the player will move next, drive a car -> just need to know the current conditions: position, speed…, don’t need to care about history.
  11. Why do we need the gamma discount factor? The discount γ is the present value of future rewards Avoids infinite returns in cyclic Markov processes Uncertainty about the future Like the bank, the money today is better than tomorrow. Animal/human behavior shows preference for immediate reward
  12. The example is from David Silver’s course. Circles and squares are states (square: terminal state) Some actions: Facebook, Quit, Study… From the 3rd state, if we chose action Pub, it may ends with different states.
  13. From state s, we can do many action, the probability of each action is π(a|s) After that, we receive reward then it can move to other state s’ with the probability Pass’
  14. From state s, we choose action a, receive reward Ras , then can move to many new states. After that, we can do many actions based on π(a’|s’)
  15. The optimal state-value function v∗(s) is the maximum value function over all policies The optimal action-value function q∗(s, a) is the maximum action-value function over all policies An MDP is “solved” when we know the optimal value The optimal value function specifies the best possible performance in the MDP If we know q∗(s; a), we immediately have the optimal policy
  16. Follow the q*, we will find the optimal policy
  17. Input: state Output: vector for q value (size : nb_actions). Dueling DQN: the first is the value function V(s), which says simple how good it is to be in any given state. The second is the advantage function A(a), which tells how much better taking a certain action would be compared to the others. We can then think of Q as being the combination of V and A.