SlideShare a Scribd company logo
1 of 24
Download to read offline
Introduction to Reinforcement Learning
Lili Wu, Laber Lab
October 23, 2018
Outline
Basics and examples
Setup and notation
Problems in RL
How do we get optimal policy given data?
How do we balance exploration and exploitation?
RL in Laber Labs
Basics and examples
Setup and notation
Problems in RL
How do we get optimal policy given data?
How do we balance exploration and exploitation?
RL in Laber Labs
Basic idea
Reinforcement learning (RL): An agent interacting with an
environment, which provides rewards
Goal: Learn how to take actions in order to maximize the
cumulative rewards
History
Figure 1: Puzzle Box. (Trial and Error
Learning)
Figure 2: Thorndike,
1911
Humans and animals learn from reward and punishment
In reinforcement learning, we try to get computers to learn
complicated skills in a similar way
Framework
Figure 3: Reinforcement learning
RL in the news
Advances in computer power and algorithms in recent years
have led to lots of interest in using RL for artificial intelligence
RL has now been used to achieve superhuman performance
for a number of difficult games
Example: Atari
Figure 4: Deep Q-Network playing
Breakout. (Mnih et al. 2015.)
States: Pixels on
screen
Actions: Move
paddle
Rewards: Points
Example: AlphaZero (Silver et al. 2017)
Figure 5: The game of Go.
States: Positions
of stones
Actions: Stone
placement
Rewards:
Win/lose
Basics and examples
Setup and notation
Problems in RL
How do we get optimal policy given data?
How do we balance exploration and exploitation?
RL in Laber Labs
Setup: MDPs
We formalize the reinforcement learning problem using a Markov
decision process (MDP) (S, A, T, r, γ):
S is the set of states the environment can be in;
A is the set of actions available to the decision-maker;
T : S × A × S → R+ is a transition function which gives the
probability distribution of the next state given the current
state and action;
r : S → R is the reward function;
γ is a discount factor, 0 ≤ γ < 1.
Data: at each time t we observe current state, action, and reward
(St, At, Rt, St+1).
Setup: Policies
Policies tell us which action to take in each state
π : S → A
Goal: choose policy to maximize expected cumulative
discounted reward
Eπ
∞
t=0
γt
Rt
Setup: Value functions
Value functions tell us the long-term rewards we can expect under
a given policy, starting from a given state and/or action.
“V-function” measures expected cumulative reward from
given state:
V π
(s) = Eπ
∞
t=0
γt
Rt | S0 = s
“Q-function” measures expected cumulative reward from
given state and action:
Qπ
(s, a) = Eπ
∞
t=0
γt
Rt | S0 = s, A0 = a
=
s ∈S
r(s ) + γV π
(s ) T(s |s, a)
Basics and examples
Setup and notation
Problems in RL
How do we get optimal policy given data?
How do we balance exploration and exploitation?
RL in Laber Labs
Problem 1: Estimating optimal policy
Two ways of getting at optimal policy π∗:
Try to improve π directly
Try to estimate Qπ∗
Example: Q-learning
Qnew
(St, At) ← (1 − α)Q(St, At) + α[Rt + γ max
a
Q(St+1, a)],
where α is learning rate, 0 ≤ α ≤ 1.
Problem 2: Exploration-exploitation tradeoff
Tradeoff between gaining information (exploration) and
following current estimate of optimal policy (exploitation)
Restaurant example
Exploitation: Go to your favorite restaurant
Exploration: Try a new place
Need to balance both to maximize cumulative deliciousness
Different strategies
Occasionally do something completely random
Act based on optimistic estimates of each action’s value
Sample action according to its posterior probability of being
optimal
A small task using RL to solve(CartPole)
Action space A = {0, 1}, represents {left, right}
State space (S1, S2, S3, S4) ∈ R4, represents (position,
velocity, angle, angular velocity)
Goal: Stand for 200 timesteps in each episode. (Large angle
or Far-away distance → Die! )
Define Rt
i ∈ {−1, 1}
Application to CartPole Problem
Basics and examples
Setup and notation
Problems in RL
How do we get optimal policy given data?
How do we balance exploration and exploitation?
RL in Laber Labs
RL in Laber Labs
At Laber Labs we apply reinforcement learning to interesting
and important real-world problems
Controlling the spread of disease
Dynamic medical treatment
Education
Sports decision-making
Stopping the spread of disease
Figure 6: The spread of white-nose
syndrome in bats, 2006-2014. States: Which
locations are
infected
Actions: Locations
to treat
Rewards: Number
of uninfected
locations
Space Mice
Figure 7: Space Mice (By Laber
Labs’ Marshall Wang).
Dynamic medical treatment
Figure 8: RL can help us customize medical
treatment to individual patients’
characteristics.
States: Current
health status
(exercise levels,
food intake, blood
pressure, blood
sugar, many more)
Actions:
Recommend
treatment
Rewards: Health
outcomes

More Related Content

Similar to PMED Undergraduate Workshop - Introduction to Reinforcement Learning - Lili Wu (Laber Labs), October 23, 2018

RL_online _presentation_1.ppt
RL_online _presentation_1.pptRL_online _presentation_1.ppt
RL_online _presentation_1.pptssuser43a599
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningDing Li
 
Financial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning ApproachFinancial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning Approach謙益 黃
 
Reinforcement Learning.ppt
Reinforcement Learning.pptReinforcement Learning.ppt
Reinforcement Learning.pptPOOJASHREEC1
 
S19_lecture6_exploreexploitinbandits.pdf
S19_lecture6_exploreexploitinbandits.pdfS19_lecture6_exploreexploitinbandits.pdf
S19_lecture6_exploreexploitinbandits.pdfLPrashanthi
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysYasutoTamura1
 
Counterfactual Learning for Recommendation
Counterfactual Learning for RecommendationCounterfactual Learning for Recommendation
Counterfactual Learning for RecommendationOlivier Jeunen
 
Reinforcement learning 7313
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313Slideshare
 
REINFORCEMENT LEARNING
REINFORCEMENT LEARNINGREINFORCEMENT LEARNING
REINFORCEMENT LEARNINGpradiprahul
 
14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptx14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptxRithikRaj25
 
Reinforcement learning Research experiments OpenAI
Reinforcement learning Research experiments OpenAIReinforcement learning Research experiments OpenAI
Reinforcement learning Research experiments OpenAIRaouf KESKES
 
0415_seminar_DeepDPG
0415_seminar_DeepDPG0415_seminar_DeepDPG
0415_seminar_DeepDPGHye-min Ahn
 
Reinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine SweeperReinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine SweeperDataScienceLab
 

Similar to PMED Undergraduate Workshop - Introduction to Reinforcement Learning - Lili Wu (Laber Labs), October 23, 2018 (20)

Playing Atari with Deep Reinforcement Learning
Playing Atari with Deep Reinforcement LearningPlaying Atari with Deep Reinforcement Learning
Playing Atari with Deep Reinforcement Learning
 
YijueRL.ppt
YijueRL.pptYijueRL.ppt
YijueRL.ppt
 
RL_online _presentation_1.ppt
RL_online _presentation_1.pptRL_online _presentation_1.ppt
RL_online _presentation_1.ppt
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Introduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement LearningIntroduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement Learning
 
Financial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning ApproachFinancial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning Approach
 
Deep RL.pdf
Deep RL.pdfDeep RL.pdf
Deep RL.pdf
 
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
 
Reinforcement Learning.ppt
Reinforcement Learning.pptReinforcement Learning.ppt
Reinforcement Learning.ppt
 
S19_lecture6_exploreexploitinbandits.pdf
S19_lecture6_exploreexploitinbandits.pdfS19_lecture6_exploreexploitinbandits.pdf
S19_lecture6_exploreexploitinbandits.pdf
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative ways
 
Counterfactual Learning for Recommendation
Counterfactual Learning for RecommendationCounterfactual Learning for Recommendation
Counterfactual Learning for Recommendation
 
Reinforcement learning 7313
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313
 
Intro rl
Intro rlIntro rl
Intro rl
 
REINFORCEMENT LEARNING
REINFORCEMENT LEARNINGREINFORCEMENT LEARNING
REINFORCEMENT LEARNING
 
14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptx14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptx
 
Reinforcement learning Research experiments OpenAI
Reinforcement learning Research experiments OpenAIReinforcement learning Research experiments OpenAI
Reinforcement learning Research experiments OpenAI
 
0415_seminar_DeepDPG
0415_seminar_DeepDPG0415_seminar_DeepDPG
0415_seminar_DeepDPG
 
Reinforcement-Learning.ppt
Reinforcement-Learning.pptReinforcement-Learning.ppt
Reinforcement-Learning.ppt
 
Reinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine SweeperReinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine Sweeper
 

More from The Statistical and Applied Mathematical Sciences Institute

More from The Statistical and Applied Mathematical Sciences Institute (20)

Causal Inference Opening Workshop - Latent Variable Models, Causal Inference,...
Causal Inference Opening Workshop - Latent Variable Models, Causal Inference,...Causal Inference Opening Workshop - Latent Variable Models, Causal Inference,...
Causal Inference Opening Workshop - Latent Variable Models, Causal Inference,...
 
2019 Fall Series: Special Guest Lecture - 0-1 Phase Transitions in High Dimen...
2019 Fall Series: Special Guest Lecture - 0-1 Phase Transitions in High Dimen...2019 Fall Series: Special Guest Lecture - 0-1 Phase Transitions in High Dimen...
2019 Fall Series: Special Guest Lecture - 0-1 Phase Transitions in High Dimen...
 
Causal Inference Opening Workshop - Causal Discovery in Neuroimaging Data - F...
Causal Inference Opening Workshop - Causal Discovery in Neuroimaging Data - F...Causal Inference Opening Workshop - Causal Discovery in Neuroimaging Data - F...
Causal Inference Opening Workshop - Causal Discovery in Neuroimaging Data - F...
 
Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...
Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...
Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...
 
Causal Inference Opening Workshop - A Bracketing Relationship between Differe...
Causal Inference Opening Workshop - A Bracketing Relationship between Differe...Causal Inference Opening Workshop - A Bracketing Relationship between Differe...
Causal Inference Opening Workshop - A Bracketing Relationship between Differe...
 
Causal Inference Opening Workshop - Testing Weak Nulls in Matched Observation...
Causal Inference Opening Workshop - Testing Weak Nulls in Matched Observation...Causal Inference Opening Workshop - Testing Weak Nulls in Matched Observation...
Causal Inference Opening Workshop - Testing Weak Nulls in Matched Observation...
 
Causal Inference Opening Workshop - Difference-in-differences: more than meet...
Causal Inference Opening Workshop - Difference-in-differences: more than meet...Causal Inference Opening Workshop - Difference-in-differences: more than meet...
Causal Inference Opening Workshop - Difference-in-differences: more than meet...
 
Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...
Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...
Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...
 
Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...
Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...
Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...
 
Causal Inference Opening Workshop - Bridging the Gap Between Causal Literatur...
Causal Inference Opening Workshop - Bridging the Gap Between Causal Literatur...Causal Inference Opening Workshop - Bridging the Gap Between Causal Literatur...
Causal Inference Opening Workshop - Bridging the Gap Between Causal Literatur...
 
Causal Inference Opening Workshop - Some Applications of Reinforcement Learni...
Causal Inference Opening Workshop - Some Applications of Reinforcement Learni...Causal Inference Opening Workshop - Some Applications of Reinforcement Learni...
Causal Inference Opening Workshop - Some Applications of Reinforcement Learni...
 
Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...
Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...
Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...
 
Causal Inference Opening Workshop - Assisting the Impact of State Polcies: Br...
Causal Inference Opening Workshop - Assisting the Impact of State Polcies: Br...Causal Inference Opening Workshop - Assisting the Impact of State Polcies: Br...
Causal Inference Opening Workshop - Assisting the Impact of State Polcies: Br...
 
Causal Inference Opening Workshop - Experimenting in Equilibrium - Stefan Wag...
Causal Inference Opening Workshop - Experimenting in Equilibrium - Stefan Wag...Causal Inference Opening Workshop - Experimenting in Equilibrium - Stefan Wag...
Causal Inference Opening Workshop - Experimenting in Equilibrium - Stefan Wag...
 
Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...
Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...
Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...
 
Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...
Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...
Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...
 
2019 Fall Series: Special Guest Lecture - Adversarial Risk Analysis of the Ge...
2019 Fall Series: Special Guest Lecture - Adversarial Risk Analysis of the Ge...2019 Fall Series: Special Guest Lecture - Adversarial Risk Analysis of the Ge...
2019 Fall Series: Special Guest Lecture - Adversarial Risk Analysis of the Ge...
 
2019 Fall Series: Professional Development, Writing Academic Papers…What Work...
2019 Fall Series: Professional Development, Writing Academic Papers…What Work...2019 Fall Series: Professional Development, Writing Academic Papers…What Work...
2019 Fall Series: Professional Development, Writing Academic Papers…What Work...
 
2019 GDRR: Blockchain Data Analytics - Machine Learning in/for Blockchain: Fu...
2019 GDRR: Blockchain Data Analytics - Machine Learning in/for Blockchain: Fu...2019 GDRR: Blockchain Data Analytics - Machine Learning in/for Blockchain: Fu...
2019 GDRR: Blockchain Data Analytics - Machine Learning in/for Blockchain: Fu...
 
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
 

Recently uploaded

The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...Pooja Nehwal
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...anjaliyadav012327
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
The byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxThe byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxShobhayan Kirtania
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 

Recently uploaded (20)

The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
The byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxThe byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptx
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 

PMED Undergraduate Workshop - Introduction to Reinforcement Learning - Lili Wu (Laber Labs), October 23, 2018

  • 1. Introduction to Reinforcement Learning Lili Wu, Laber Lab October 23, 2018
  • 2. Outline Basics and examples Setup and notation Problems in RL How do we get optimal policy given data? How do we balance exploration and exploitation? RL in Laber Labs
  • 3. Basics and examples Setup and notation Problems in RL How do we get optimal policy given data? How do we balance exploration and exploitation? RL in Laber Labs
  • 4. Basic idea Reinforcement learning (RL): An agent interacting with an environment, which provides rewards Goal: Learn how to take actions in order to maximize the cumulative rewards
  • 5. History Figure 1: Puzzle Box. (Trial and Error Learning) Figure 2: Thorndike, 1911
  • 6. Humans and animals learn from reward and punishment In reinforcement learning, we try to get computers to learn complicated skills in a similar way
  • 8. RL in the news Advances in computer power and algorithms in recent years have led to lots of interest in using RL for artificial intelligence RL has now been used to achieve superhuman performance for a number of difficult games
  • 9. Example: Atari Figure 4: Deep Q-Network playing Breakout. (Mnih et al. 2015.) States: Pixels on screen Actions: Move paddle Rewards: Points
  • 10. Example: AlphaZero (Silver et al. 2017) Figure 5: The game of Go. States: Positions of stones Actions: Stone placement Rewards: Win/lose
  • 11. Basics and examples Setup and notation Problems in RL How do we get optimal policy given data? How do we balance exploration and exploitation? RL in Laber Labs
  • 12. Setup: MDPs We formalize the reinforcement learning problem using a Markov decision process (MDP) (S, A, T, r, γ): S is the set of states the environment can be in; A is the set of actions available to the decision-maker; T : S × A × S → R+ is a transition function which gives the probability distribution of the next state given the current state and action; r : S → R is the reward function; γ is a discount factor, 0 ≤ γ < 1. Data: at each time t we observe current state, action, and reward (St, At, Rt, St+1).
  • 13. Setup: Policies Policies tell us which action to take in each state π : S → A Goal: choose policy to maximize expected cumulative discounted reward Eπ ∞ t=0 γt Rt
  • 14. Setup: Value functions Value functions tell us the long-term rewards we can expect under a given policy, starting from a given state and/or action. “V-function” measures expected cumulative reward from given state: V π (s) = Eπ ∞ t=0 γt Rt | S0 = s “Q-function” measures expected cumulative reward from given state and action: Qπ (s, a) = Eπ ∞ t=0 γt Rt | S0 = s, A0 = a = s ∈S r(s ) + γV π (s ) T(s |s, a)
  • 15. Basics and examples Setup and notation Problems in RL How do we get optimal policy given data? How do we balance exploration and exploitation? RL in Laber Labs
  • 16. Problem 1: Estimating optimal policy Two ways of getting at optimal policy π∗: Try to improve π directly Try to estimate Qπ∗ Example: Q-learning Qnew (St, At) ← (1 − α)Q(St, At) + α[Rt + γ max a Q(St+1, a)], where α is learning rate, 0 ≤ α ≤ 1.
  • 17. Problem 2: Exploration-exploitation tradeoff Tradeoff between gaining information (exploration) and following current estimate of optimal policy (exploitation) Restaurant example Exploitation: Go to your favorite restaurant Exploration: Try a new place Need to balance both to maximize cumulative deliciousness Different strategies Occasionally do something completely random Act based on optimistic estimates of each action’s value Sample action according to its posterior probability of being optimal
  • 18. A small task using RL to solve(CartPole) Action space A = {0, 1}, represents {left, right} State space (S1, S2, S3, S4) ∈ R4, represents (position, velocity, angle, angular velocity) Goal: Stand for 200 timesteps in each episode. (Large angle or Far-away distance → Die! ) Define Rt i ∈ {−1, 1}
  • 20. Basics and examples Setup and notation Problems in RL How do we get optimal policy given data? How do we balance exploration and exploitation? RL in Laber Labs
  • 21. RL in Laber Labs At Laber Labs we apply reinforcement learning to interesting and important real-world problems Controlling the spread of disease Dynamic medical treatment Education Sports decision-making
  • 22. Stopping the spread of disease Figure 6: The spread of white-nose syndrome in bats, 2006-2014. States: Which locations are infected Actions: Locations to treat Rewards: Number of uninfected locations
  • 23. Space Mice Figure 7: Space Mice (By Laber Labs’ Marshall Wang).
  • 24. Dynamic medical treatment Figure 8: RL can help us customize medical treatment to individual patients’ characteristics. States: Current health status (exercise levels, food intake, blood pressure, blood sugar, many more) Actions: Recommend treatment Rewards: Health outcomes