SlideShare a Scribd company logo
1 of 66
Download to read offline
NILOOFAR SEDIGHIAN BIDGOLI
MACHINE LEARNING COURSE
CS DEPARTMENT, SBU UNIVERSITY
JUNE 2020, TEHRAN, IRAN
When it is not in our power to determine
what is true, we ought to act in accordance
with what is most probable.
- Descartes
That thing is a
“double bacon cheese
burger
N.Sedighian - CS Dep. SBU - 06/2020
That thing is like this
other thing
N.Sedighian - CS Dep. SBU - 06/2020
Eat that thing because it
tastes good and will keep
you alive longer
N.Sedighian - CS Dep. SBU - 06/2020
Deep reinforcement learning is
about how we make decisions
To tackle decision-making problems under uncertainty
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
Two core components in a RL system
 Agent: represents the “solution”
 A computer program with a single role of making decisions to solve complex
decision-making problems under uncertainty.
 An Environment: that is the representation of a “problem”
 Everything that comes after the decision of the Agent.
N.Sedighian - CS Dep. SBU - 06/2020
Notations:
 State = s = x
 Action = control = a = u
 Policy 𝜋𝜋(𝑎𝑎|𝑠𝑠) is defined as probability and not as a concrete action
 like weights in Deep Learning method, parameterized by θ
 Gamma: We discount rewards or lower their estimated value in the future
 Human intuition: “In the long run, we are all dead.
 If it is 1: we care about all rewards equally
 If it is 0: we care only about the immediate reward
N.Sedighian - CS Dep. SBU - 06/2020
Policy
N.Sedighian - CS Dep. SBU - 06/2020
Intuition: why humans?
 If you are the agent, the environment could be the laws of physics and the
rules of society that process your actions and determine the
consequences of them.
Were you ever in the wrong place at the wrong time?
That’s a state
N.Sedighian - CS Dep. SBU - 06/2020
There is no training data here
 Like humans learning how to live (and survive!) as a kid
 By trial and error
 With positive or negative rewards
 Reward and punishment method
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
Google's artificial
intelligence company,
DeepMind, has
developed an AI that
has managed to learn
how to walk, run, jump,
and climb without any
prior guidance. The result
is as impressive as it is
goofy
Watch Video
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
Google
DeepMind
Learning to play Atari
Watch Video
N.Sedighian - CS Dep. SBU - 06/2020
Reward vs Value
 Reward (Return) is an immediate signal that is received in a given state,
while value is the sum of all rewards you might anticipate from that state.
 Value is a long-term expectation, while reward is an immediate pleasure.
N.Sedighian - CS Dep. SBU - 06/2020
Return
N.Sedighian - CS Dep. SBU - 06/2020
Tasks
 Natural ending: episodic tasks -> games
 Episode: sequence of time steps
 The sum of rewards collected in a single episode is called a return. Agents are
often designed to maximize the return.
 Without natural ending: continuing tasks -> learning forward motion
N.Sedighian - CS Dep. SBU - 06/2020
How the environment reacts to
certain actions is defined by a model
which may or may not be known by
the Agent
Approaches
 Analyze how good to reach a certain state or take a specific action (i.e.
Value-learning)
 measures the total rewards that you get from a particular state following a
specific policy
 Go cheat sheet
 uses V or Q value to derive the optimal policy
 Q- Learning
 Use the model to find actions that have the maximum rewards (model-
based learning)
 Model-based RL uses the model and the cost function to find the optimal path
 Derive a policy directly to maximize rewards (policy gradient)
 For actions with better rewards, we make it more likely to happen (or vice versa).N.Sedighian - CS Dep. SBU - 06/2020
For a model
based learning
Watch this →
Watch Video
N.Sedighian - CS Dep. SBU - 06/2020
RL;
exploit and explore
How can we
mathematically formalize
the RL problem
• MARKOV DECISION PROCESSES FORMALIZE THE REINFORCEMENT
LEARNING PROBLEM SET
• AND Q-LEARNING AND POLICY GRADIENTS ARE 2 MAJOR
ALGORITHMS IN THIS AREA
MDP
 Attempt to model a complex probability distribution of rewards in relation
to a very large number of state-action pair
 Markov decision process, a method to sample from a complex distribution
to infer its properties. even when we do not understand the mechanism by
which they relate
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
MPD
• Genes on a chromosome are
states. To read them (and create
amino acids) is to go through
their transitions
• Emotions are states in a
psychological system. Mood
swings are the transitions.
N.Sedighian - CS Dep. SBU - 06/2020
Markov chains have a particular property:
oblivion. Forgetting
It assume the entirety of the past is encoded in
the present
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
Q-learning
"quality" of an action taken in a given state
 Q-learning is a model-free reinforcement learning algorithm to learn a
policy telling an agent what action to take under what circumstances.
 For any finite Markov decision process (FMDP), Q-learning finds an optimal
policy in the sense of maximizing the expected value of the total reward
over any and all successive steps, starting from the current state.
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
Q
A value for each state-action pair, which is called
the action-value function, also known as Q-function.
It is usually denoted by 𝑄𝑄𝜋𝜋 (𝑠𝑠, 𝑎𝑎) and refers to the
expected return 𝐺𝐺 when the Agent is at state 𝑠𝑠 and
takes action 𝑎𝑎 following the policy 𝜋𝜋.
N.Sedighian - CS Dep. SBU - 06/2020
Break
west world…
Creation of Adam, 1508-1512
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
Bellman Equation
It writes the "value" of a decision problem at a
certain point in time in terms of the payoff from
some initial choices and the "value" of the
remaining decision problem that results from
those initial choices
that if we know the value of 𝑠𝑠𝑡𝑡+!, we can very easily calculate the value of 𝑠𝑠𝑡𝑡.
N.Sedighian - CS Dep. SBU - 06/2020
Iteration Phase:
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
DQN
Deep Q-network
Using a deep network to estimate Q
N.Sedighian - CS Dep. SBU - 06/2020
Experience Replay
Experience replay stores the last million of state-
action-reward in a replay buffer. We train Q with
batches of random samples from this buffer
 enabling the RL agent to sample from and train on previously observed data offline
 massively reduce the amount of interactions needed with the environment,
 batches of experience can be sampled, reducing the variance of learning updates
N.Sedighian - CS Dep. SBU - 06/2020
Experience!
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
Reinforce rule
= estimator of gradient
We change the policy in the direction with the steepest reward increase
It means for actions with better rewards, we make it more likely to happen
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
Actor-critic set-up:
The “actor”
(policy) learns by
using feedback
from the “critic”
(value function).
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
So…
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
Questions
Sophia, on from 2016N.Sedighian - CS Dep. SBU - 06/2020
Thank you
N.Sedighian - CS Dep. SBU - 06/2020

More Related Content

What's hot

Bias and variance trade off
Bias and variance trade offBias and variance trade off
Bias and variance trade offVARUN KUMAR
 
Support Vector Machines for Classification
Support Vector Machines for ClassificationSupport Vector Machines for Classification
Support Vector Machines for ClassificationPrakash Pimpale
 
Ml10 dimensionality reduction-and_advanced_topics
Ml10 dimensionality reduction-and_advanced_topicsMl10 dimensionality reduction-and_advanced_topics
Ml10 dimensionality reduction-and_advanced_topicsankit_ppt
 
Introduction to the Artificial Intelligence and Computer Vision revolution
Introduction to the Artificial Intelligence and Computer Vision revolutionIntroduction to the Artificial Intelligence and Computer Vision revolution
Introduction to the Artificial Intelligence and Computer Vision revolutionDarian Frajberg
 
Kernels and Support Vector Machines
Kernels and Support Vector  MachinesKernels and Support Vector  Machines
Kernels and Support Vector MachinesEdgar Marca
 
Content-Based Image Retrieval (D2L6 Insight@DCU Machine Learning Workshop 2017)
Content-Based Image Retrieval (D2L6 Insight@DCU Machine Learning Workshop 2017)Content-Based Image Retrieval (D2L6 Insight@DCU Machine Learning Workshop 2017)
Content-Based Image Retrieval (D2L6 Insight@DCU Machine Learning Workshop 2017)Universitat Politècnica de Catalunya
 
Recurrent Neural Networks for Recommendations and Personalization with Nick P...
Recurrent Neural Networks for Recommendations and Personalization with Nick P...Recurrent Neural Networks for Recommendations and Personalization with Nick P...
Recurrent Neural Networks for Recommendations and Personalization with Nick P...Databricks
 
A Deep Journey into Super-resolution
A Deep Journey into Super-resolutionA Deep Journey into Super-resolution
A Deep Journey into Super-resolutionRonak Mehta
 
Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Gaurav Mittal
 
PR-183: MixNet: Mixed Depthwise Convolutional Kernels
PR-183: MixNet: Mixed Depthwise Convolutional KernelsPR-183: MixNet: Mixed Depthwise Convolutional Kernels
PR-183: MixNet: Mixed Depthwise Convolutional KernelsJinwon Lee
 
Hadoop Presentation - PPT
Hadoop Presentation - PPTHadoop Presentation - PPT
Hadoop Presentation - PPTAnand Pandey
 
COM2304: Introduction to Computer Vision & Image Processing
COM2304: Introduction to Computer Vision & Image Processing COM2304: Introduction to Computer Vision & Image Processing
COM2304: Introduction to Computer Vision & Image Processing Hemantha Kulathilake
 
Classifying Text using CNN
Classifying Text using CNNClassifying Text using CNN
Classifying Text using CNNSomnath Banerjee
 
Recommendation system
Recommendation system Recommendation system
Recommendation system Vikrant Arya
 
"Getting More from Your Datasets: Data Augmentation, Annotation and Generativ...
"Getting More from Your Datasets: Data Augmentation, Annotation and Generativ..."Getting More from Your Datasets: Data Augmentation, Annotation and Generativ...
"Getting More from Your Datasets: Data Augmentation, Annotation and Generativ...Edge AI and Vision Alliance
 
Control as Inference.pptx
Control as Inference.pptxControl as Inference.pptx
Control as Inference.pptxssuserbd1647
 

What's hot (20)

Bias and variance trade off
Bias and variance trade offBias and variance trade off
Bias and variance trade off
 
Support Vector Machines for Classification
Support Vector Machines for ClassificationSupport Vector Machines for Classification
Support Vector Machines for Classification
 
Xgboost
XgboostXgboost
Xgboost
 
Ml10 dimensionality reduction-and_advanced_topics
Ml10 dimensionality reduction-and_advanced_topicsMl10 dimensionality reduction-and_advanced_topics
Ml10 dimensionality reduction-and_advanced_topics
 
Introduction to the Artificial Intelligence and Computer Vision revolution
Introduction to the Artificial Intelligence and Computer Vision revolutionIntroduction to the Artificial Intelligence and Computer Vision revolution
Introduction to the Artificial Intelligence and Computer Vision revolution
 
Domain adaptation
Domain adaptationDomain adaptation
Domain adaptation
 
Ebgan
EbganEbgan
Ebgan
 
Kernels and Support Vector Machines
Kernels and Support Vector  MachinesKernels and Support Vector  Machines
Kernels and Support Vector Machines
 
Content-Based Image Retrieval (D2L6 Insight@DCU Machine Learning Workshop 2017)
Content-Based Image Retrieval (D2L6 Insight@DCU Machine Learning Workshop 2017)Content-Based Image Retrieval (D2L6 Insight@DCU Machine Learning Workshop 2017)
Content-Based Image Retrieval (D2L6 Insight@DCU Machine Learning Workshop 2017)
 
Recurrent Neural Networks for Recommendations and Personalization with Nick P...
Recurrent Neural Networks for Recommendations and Personalization with Nick P...Recurrent Neural Networks for Recommendations and Personalization with Nick P...
Recurrent Neural Networks for Recommendations and Personalization with Nick P...
 
A Deep Journey into Super-resolution
A Deep Journey into Super-resolutionA Deep Journey into Super-resolution
A Deep Journey into Super-resolution
 
Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)
 
PR-183: MixNet: Mixed Depthwise Convolutional Kernels
PR-183: MixNet: Mixed Depthwise Convolutional KernelsPR-183: MixNet: Mixed Depthwise Convolutional Kernels
PR-183: MixNet: Mixed Depthwise Convolutional Kernels
 
Hadoop Presentation - PPT
Hadoop Presentation - PPTHadoop Presentation - PPT
Hadoop Presentation - PPT
 
COM2304: Introduction to Computer Vision & Image Processing
COM2304: Introduction to Computer Vision & Image Processing COM2304: Introduction to Computer Vision & Image Processing
COM2304: Introduction to Computer Vision & Image Processing
 
Classifying Text using CNN
Classifying Text using CNNClassifying Text using CNN
Classifying Text using CNN
 
Recommendation system
Recommendation system Recommendation system
Recommendation system
 
"Getting More from Your Datasets: Data Augmentation, Annotation and Generativ...
"Getting More from Your Datasets: Data Augmentation, Annotation and Generativ..."Getting More from Your Datasets: Data Augmentation, Annotation and Generativ...
"Getting More from Your Datasets: Data Augmentation, Annotation and Generativ...
 
Support vector machine
Support vector machineSupport vector machine
Support vector machine
 
Control as Inference.pptx
Control as Inference.pptxControl as Inference.pptx
Control as Inference.pptx
 

Similar to RL presentation

Reinforcement Learning to Mimic Portfolio Behavior
Reinforcement Learning to Mimic Portfolio BehaviorReinforcement Learning to Mimic Portfolio Behavior
Reinforcement Learning to Mimic Portfolio BehaviorYigal D. Jhirad
 
Evaluate deep q learning for sequential targeted marketing with 10-fold cross...
Evaluate deep q learning for sequential targeted marketing with 10-fold cross...Evaluate deep q learning for sequential targeted marketing with 10-fold cross...
Evaluate deep q learning for sequential targeted marketing with 10-fold cross...Jian Wu
 
Reinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine SweeperReinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine SweeperDataScienceLab
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningSalem-Kabbani
 
Simulations Project Part II.pdf
Simulations Project Part II.pdfSimulations Project Part II.pdf
Simulations Project Part II.pdfJeanMarshall8
 
Applying Machine Learning for Mobile Games by Neil Patrick Del Gallego
Applying Machine Learning for Mobile Games by Neil Patrick Del GallegoApplying Machine Learning for Mobile Games by Neil Patrick Del Gallego
Applying Machine Learning for Mobile Games by Neil Patrick Del GallegoDEVCON
 
Using Open Source Tools for Machine Learning
Using Open Source Tools for Machine LearningUsing Open Source Tools for Machine Learning
Using Open Source Tools for Machine LearningAll Things Open
 
Modern Recommendation for Advanced Practitioners part2
Modern Recommendation for Advanced Practitioners part2Modern Recommendation for Advanced Practitioners part2
Modern Recommendation for Advanced Practitioners part2Flavian Vasile
 
Neural Network Model
Neural Network ModelNeural Network Model
Neural Network ModelEric Esajian
 
Why Now is the Best Time to Have a Phantom Stock Plan
Why Now is the Best Time to Have a Phantom Stock PlanWhy Now is the Best Time to Have a Phantom Stock Plan
Why Now is the Best Time to Have a Phantom Stock PlanThe VisionLink Advisory Group
 
Tensorflow KR PR12(Season 3) : 251th Paper Review
Tensorflow KR PR12(Season 3) : 251th Paper ReviewTensorflow KR PR12(Season 3) : 251th Paper Review
Tensorflow KR PR12(Season 3) : 251th Paper ReviewChanghoon Jeong
 
0415_seminar_DeepDPG
0415_seminar_DeepDPG0415_seminar_DeepDPG
0415_seminar_DeepDPGHye-min Ahn
 
Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...
Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...
Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...Universitat Politècnica de Catalunya
 
Project global systems development corporation
Project global systems development corporationProject global systems development corporation
Project global systems development corporationReese Boone
 
Loan Eligibility Checker
Loan Eligibility CheckerLoan Eligibility Checker
Loan Eligibility CheckerKiranVodela
 
Predictive analytics for ROI driven decision making
Predictive analytics for ROI driven decision makingPredictive analytics for ROI driven decision making
Predictive analytics for ROI driven decision makingSai Kumar Devulapalli
 
Assignment 6 – Overall Instruction 1 Assignment 6 Pay.docx
Assignment 6 – Overall Instruction 1 Assignment 6 Pay.docxAssignment 6 – Overall Instruction 1 Assignment 6 Pay.docx
Assignment 6 – Overall Instruction 1 Assignment 6 Pay.docxbraycarissa250
 
NUS-ISS Learning Day 2018-How to train your program to play black jack
NUS-ISS Learning Day 2018-How to train your program to play black jackNUS-ISS Learning Day 2018-How to train your program to play black jack
NUS-ISS Learning Day 2018-How to train your program to play black jackNUS-ISS
 
Showing ROI for Your Analytic Project
Showing ROI for Your Analytic ProjectShowing ROI for Your Analytic Project
Showing ROI for Your Analytic ProjectDATAVERSITY
 

Similar to RL presentation (20)

Reinforcement Learning to Mimic Portfolio Behavior
Reinforcement Learning to Mimic Portfolio BehaviorReinforcement Learning to Mimic Portfolio Behavior
Reinforcement Learning to Mimic Portfolio Behavior
 
Evaluate deep q learning for sequential targeted marketing with 10-fold cross...
Evaluate deep q learning for sequential targeted marketing with 10-fold cross...Evaluate deep q learning for sequential targeted marketing with 10-fold cross...
Evaluate deep q learning for sequential targeted marketing with 10-fold cross...
 
Reinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine SweeperReinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine Sweeper
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Simulations Project Part II.pdf
Simulations Project Part II.pdfSimulations Project Part II.pdf
Simulations Project Part II.pdf
 
Applying Machine Learning for Mobile Games by Neil Patrick Del Gallego
Applying Machine Learning for Mobile Games by Neil Patrick Del GallegoApplying Machine Learning for Mobile Games by Neil Patrick Del Gallego
Applying Machine Learning for Mobile Games by Neil Patrick Del Gallego
 
Using Open Source Tools for Machine Learning
Using Open Source Tools for Machine LearningUsing Open Source Tools for Machine Learning
Using Open Source Tools for Machine Learning
 
Modern Recommendation for Advanced Practitioners part2
Modern Recommendation for Advanced Practitioners part2Modern Recommendation for Advanced Practitioners part2
Modern Recommendation for Advanced Practitioners part2
 
Neural Network Model
Neural Network ModelNeural Network Model
Neural Network Model
 
Why Now is the Best Time to Have a Phantom Stock Plan
Why Now is the Best Time to Have a Phantom Stock PlanWhy Now is the Best Time to Have a Phantom Stock Plan
Why Now is the Best Time to Have a Phantom Stock Plan
 
Tensorflow KR PR12(Season 3) : 251th Paper Review
Tensorflow KR PR12(Season 3) : 251th Paper ReviewTensorflow KR PR12(Season 3) : 251th Paper Review
Tensorflow KR PR12(Season 3) : 251th Paper Review
 
0415_seminar_DeepDPG
0415_seminar_DeepDPG0415_seminar_DeepDPG
0415_seminar_DeepDPG
 
scrib.pptx
scrib.pptxscrib.pptx
scrib.pptx
 
Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...
Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...
Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...
 
Project global systems development corporation
Project global systems development corporationProject global systems development corporation
Project global systems development corporation
 
Loan Eligibility Checker
Loan Eligibility CheckerLoan Eligibility Checker
Loan Eligibility Checker
 
Predictive analytics for ROI driven decision making
Predictive analytics for ROI driven decision makingPredictive analytics for ROI driven decision making
Predictive analytics for ROI driven decision making
 
Assignment 6 – Overall Instruction 1 Assignment 6 Pay.docx
Assignment 6 – Overall Instruction 1 Assignment 6 Pay.docxAssignment 6 – Overall Instruction 1 Assignment 6 Pay.docx
Assignment 6 – Overall Instruction 1 Assignment 6 Pay.docx
 
NUS-ISS Learning Day 2018-How to train your program to play black jack
NUS-ISS Learning Day 2018-How to train your program to play black jackNUS-ISS Learning Day 2018-How to train your program to play black jack
NUS-ISS Learning Day 2018-How to train your program to play black jack
 
Showing ROI for Your Analytic Project
Showing ROI for Your Analytic ProjectShowing ROI for Your Analytic Project
Showing ROI for Your Analytic Project
 

Recently uploaded

Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 

Recently uploaded (20)

Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 

RL presentation

  • 1. NILOOFAR SEDIGHIAN BIDGOLI MACHINE LEARNING COURSE CS DEPARTMENT, SBU UNIVERSITY JUNE 2020, TEHRAN, IRAN
  • 2. When it is not in our power to determine what is true, we ought to act in accordance with what is most probable. - Descartes
  • 3. That thing is a “double bacon cheese burger N.Sedighian - CS Dep. SBU - 06/2020
  • 4. That thing is like this other thing N.Sedighian - CS Dep. SBU - 06/2020
  • 5. Eat that thing because it tastes good and will keep you alive longer N.Sedighian - CS Dep. SBU - 06/2020
  • 6. Deep reinforcement learning is about how we make decisions To tackle decision-making problems under uncertainty N.Sedighian - CS Dep. SBU - 06/2020
  • 7. N.Sedighian - CS Dep. SBU - 06/2020
  • 8. Two core components in a RL system  Agent: represents the “solution”  A computer program with a single role of making decisions to solve complex decision-making problems under uncertainty.  An Environment: that is the representation of a “problem”  Everything that comes after the decision of the Agent. N.Sedighian - CS Dep. SBU - 06/2020
  • 9. Notations:  State = s = x  Action = control = a = u  Policy 𝜋𝜋(𝑎𝑎|𝑠𝑠) is defined as probability and not as a concrete action  like weights in Deep Learning method, parameterized by θ  Gamma: We discount rewards or lower their estimated value in the future  Human intuition: “In the long run, we are all dead.  If it is 1: we care about all rewards equally  If it is 0: we care only about the immediate reward N.Sedighian - CS Dep. SBU - 06/2020
  • 10. Policy N.Sedighian - CS Dep. SBU - 06/2020
  • 11. Intuition: why humans?  If you are the agent, the environment could be the laws of physics and the rules of society that process your actions and determine the consequences of them. Were you ever in the wrong place at the wrong time? That’s a state N.Sedighian - CS Dep. SBU - 06/2020
  • 12. There is no training data here  Like humans learning how to live (and survive!) as a kid  By trial and error  With positive or negative rewards  Reward and punishment method N.Sedighian - CS Dep. SBU - 06/2020
  • 13. N.Sedighian - CS Dep. SBU - 06/2020
  • 14. N.Sedighian - CS Dep. SBU - 06/2020
  • 15. Google's artificial intelligence company, DeepMind, has developed an AI that has managed to learn how to walk, run, jump, and climb without any prior guidance. The result is as impressive as it is goofy Watch Video N.Sedighian - CS Dep. SBU - 06/2020
  • 16. N.Sedighian - CS Dep. SBU - 06/2020
  • 17. Google DeepMind Learning to play Atari Watch Video N.Sedighian - CS Dep. SBU - 06/2020
  • 18. Reward vs Value  Reward (Return) is an immediate signal that is received in a given state, while value is the sum of all rewards you might anticipate from that state.  Value is a long-term expectation, while reward is an immediate pleasure. N.Sedighian - CS Dep. SBU - 06/2020
  • 19. Return N.Sedighian - CS Dep. SBU - 06/2020
  • 20. Tasks  Natural ending: episodic tasks -> games  Episode: sequence of time steps  The sum of rewards collected in a single episode is called a return. Agents are often designed to maximize the return.  Without natural ending: continuing tasks -> learning forward motion N.Sedighian - CS Dep. SBU - 06/2020
  • 21. How the environment reacts to certain actions is defined by a model which may or may not be known by the Agent
  • 22. Approaches  Analyze how good to reach a certain state or take a specific action (i.e. Value-learning)  measures the total rewards that you get from a particular state following a specific policy  Go cheat sheet  uses V or Q value to derive the optimal policy  Q- Learning  Use the model to find actions that have the maximum rewards (model- based learning)  Model-based RL uses the model and the cost function to find the optimal path  Derive a policy directly to maximize rewards (policy gradient)  For actions with better rewards, we make it more likely to happen (or vice versa).N.Sedighian - CS Dep. SBU - 06/2020
  • 23. For a model based learning Watch this → Watch Video N.Sedighian - CS Dep. SBU - 06/2020
  • 25. How can we mathematically formalize the RL problem • MARKOV DECISION PROCESSES FORMALIZE THE REINFORCEMENT LEARNING PROBLEM SET • AND Q-LEARNING AND POLICY GRADIENTS ARE 2 MAJOR ALGORITHMS IN THIS AREA
  • 26. MDP  Attempt to model a complex probability distribution of rewards in relation to a very large number of state-action pair  Markov decision process, a method to sample from a complex distribution to infer its properties. even when we do not understand the mechanism by which they relate N.Sedighian - CS Dep. SBU - 06/2020
  • 27. N.Sedighian - CS Dep. SBU - 06/2020
  • 28. N.Sedighian - CS Dep. SBU - 06/2020
  • 29. N.Sedighian - CS Dep. SBU - 06/2020
  • 30. N.Sedighian - CS Dep. SBU - 06/2020
  • 31. N.Sedighian - CS Dep. SBU - 06/2020
  • 32. MPD • Genes on a chromosome are states. To read them (and create amino acids) is to go through their transitions • Emotions are states in a psychological system. Mood swings are the transitions. N.Sedighian - CS Dep. SBU - 06/2020
  • 33. Markov chains have a particular property: oblivion. Forgetting It assume the entirety of the past is encoded in the present N.Sedighian - CS Dep. SBU - 06/2020
  • 34. N.Sedighian - CS Dep. SBU - 06/2020
  • 35. Q-learning "quality" of an action taken in a given state  Q-learning is a model-free reinforcement learning algorithm to learn a policy telling an agent what action to take under what circumstances.  For any finite Markov decision process (FMDP), Q-learning finds an optimal policy in the sense of maximizing the expected value of the total reward over any and all successive steps, starting from the current state. N.Sedighian - CS Dep. SBU - 06/2020
  • 36. N.Sedighian - CS Dep. SBU - 06/2020
  • 37. Q A value for each state-action pair, which is called the action-value function, also known as Q-function. It is usually denoted by 𝑄𝑄𝜋𝜋 (𝑠𝑠, 𝑎𝑎) and refers to the expected return 𝐺𝐺 when the Agent is at state 𝑠𝑠 and takes action 𝑎𝑎 following the policy 𝜋𝜋. N.Sedighian - CS Dep. SBU - 06/2020
  • 38. Break west world… Creation of Adam, 1508-1512 N.Sedighian - CS Dep. SBU - 06/2020
  • 39. N.Sedighian - CS Dep. SBU - 06/2020
  • 40. Bellman Equation It writes the "value" of a decision problem at a certain point in time in terms of the payoff from some initial choices and the "value" of the remaining decision problem that results from those initial choices that if we know the value of 𝑠𝑠𝑡𝑡+!, we can very easily calculate the value of 𝑠𝑠𝑡𝑡. N.Sedighian - CS Dep. SBU - 06/2020
  • 41. Iteration Phase: N.Sedighian - CS Dep. SBU - 06/2020
  • 42. N.Sedighian - CS Dep. SBU - 06/2020
  • 43. N.Sedighian - CS Dep. SBU - 06/2020
  • 44. N.Sedighian - CS Dep. SBU - 06/2020
  • 45. DQN Deep Q-network Using a deep network to estimate Q N.Sedighian - CS Dep. SBU - 06/2020
  • 46. Experience Replay Experience replay stores the last million of state- action-reward in a replay buffer. We train Q with batches of random samples from this buffer  enabling the RL agent to sample from and train on previously observed data offline  massively reduce the amount of interactions needed with the environment,  batches of experience can be sampled, reducing the variance of learning updates N.Sedighian - CS Dep. SBU - 06/2020
  • 47. Experience! N.Sedighian - CS Dep. SBU - 06/2020
  • 48. N.Sedighian - CS Dep. SBU - 06/2020
  • 49. N.Sedighian - CS Dep. SBU - 06/2020
  • 50. N.Sedighian - CS Dep. SBU - 06/2020
  • 51. Reinforce rule = estimator of gradient We change the policy in the direction with the steepest reward increase It means for actions with better rewards, we make it more likely to happen N.Sedighian - CS Dep. SBU - 06/2020
  • 52. N.Sedighian - CS Dep. SBU - 06/2020
  • 53. N.Sedighian - CS Dep. SBU - 06/2020
  • 54. N.Sedighian - CS Dep. SBU - 06/2020
  • 55. N.Sedighian - CS Dep. SBU - 06/2020
  • 56. N.Sedighian - CS Dep. SBU - 06/2020
  • 57. N.Sedighian - CS Dep. SBU - 06/2020
  • 58. Actor-critic set-up: The “actor” (policy) learns by using feedback from the “critic” (value function). N.Sedighian - CS Dep. SBU - 06/2020
  • 59. N.Sedighian - CS Dep. SBU - 06/2020
  • 60. N.Sedighian - CS Dep. SBU - 06/2020
  • 61. N.Sedighian - CS Dep. SBU - 06/2020
  • 62. N.Sedighian - CS Dep. SBU - 06/2020
  • 63. So… N.Sedighian - CS Dep. SBU - 06/2020
  • 64. N.Sedighian - CS Dep. SBU - 06/2020
  • 65. Questions Sophia, on from 2016N.Sedighian - CS Dep. SBU - 06/2020
  • 66. Thank you N.Sedighian - CS Dep. SBU - 06/2020