SlideShare a Scribd company logo
Challenges
In
Reinforcement Learning
Elias Hasnat
Definition of Reinforcement Learning
We can explain reinforcement learning as a combination of
supervised and unsupervised learning.
1. In supervised learning one has a target label for each
training example and
2. In unsupervised learning one has no labels at all,
3. In reinforcement learning one has sparse and time-
delayed labels (the rewards).
Based only on those rewards the agent has to learn to behave in the environment.
1. Credit assignment problem and
2. Exploration-Exploitation Dilemma
Steps in consideration in RL problem solving
Credit Assignment Step is used for defining the strategy,
Exploration-Exploitation is used for improving the reward
outcomes.
Credit Assignment
Mathematical relation formalization
● An agent, situated in an environment.
● The environment is in a certain state.
● The agent can perform certain actions in the environment.
● These actions sometimes result in a reward.
● Actions transform the environment and lead to a new state, where the
agent can perform another action, and so on.
● The rules for how you choose those actions are called policy.
● The environment in general is stochastic, which means the next state may
be somewhat random.
● The set of states and actions, together with rules for transitioning from one
state to another, make up a Markov decision process.
● One episode of this process forms a finite sequence of states, actions and
rewards.
Reward Strategy
To perform well in the long-term, we need to take into account not only the
immediate rewards, but also the future rewards we are going to get.
total reward function
total reward function from time point t
But as the environment is stochastic we never get same reward in the next run,
that is why we be considering
discounted future reward
γ is the discount factor between 0 and 1, as the seps increased the γ value goes
near to 0.
Discounted future reward calculation
Discounted future reward at time step t can be expressed in terms of the same
thing at time step t+1:
If we set the discount factor γ=0, then our strategy will be short-sighted and we
rely only on the immediate rewards.
If we want to balance between immediate and future rewards, we should set
discount factor to something like γ=0.9.
If our environment is deterministic and the same actions always result in same
rewards, then we can set discount factor γ=1.
A good strategy for an agent would be to always choose an action that maximizes the (discounted) future reward.
The Q-function is for improved quality in action
The Q-Learning function Q(s, a) could be represented for
the maximum discounted future reward when we perform action a in state s, and
continue optimally from that point on:
Suppose you are in state and thinking whether you should take action a or b. You
want to select the action that results in the highest score at the end of game.
π represents the policy,
Now Let’s focus on just one transition <s, a, r, s’>. we can express the Q-value of
state s and action a in terms of the Q-value of the next state s’.
This is called the Bellman Equation.
Q-Learning Formulation
α in the algorithm is a learning rate that controls how much of the difference
between previous Q-value and newly proposed Q-value is taken into account.
In particular, when α=1, then two Q[s,a] cancel and the update is exactly the
same as the Bellman equation.
Deep Q Network
Neural networks are exceptionally good at coming up
with good features for highly structured data.
We can easily represent our Q-function with a
neural network,
Use-case 1:
that takes the state and action as input and
outputs the corresponding Q-value.
Use-case 2:
Input the state as input and
output the Q-value for each possible action.
* The benefit of the later use-case over the former one is that if we want to perform a Q-value update or pick the action with the highest Q-value, we only have to
do one forward pass through the network and have all Q-values for all actions available immediately.
Rewrite the Quality function using Q Network
Given a transition < s, a, r, s’ >, the Q-table update rule in the previous algorithm
must be replaced with the following:
1. Do a feedforward pass for the current state s to get predicted Q-values for all
actions.
2. Do a feedforward pass for the next state s’ and calculate maximum overall
network outputs max a’ Q(s’, a’).
3. Set Q-value target for action to r + γmax a’ Q(s’, a’) (use the max calculated
in step 2). For all other actions, set the Q-value target to the same as
originally returned from step 1, making the error 0 for those outputs.
4. Update the weights using backpropagation.
Experience Replay (Memory)
Estimate the future reward in each state using Q-learning and approximate the Q-
function using neural network.
Challenge:
But the approximation of Q-values using non-linear functions is very time costly.
Solution: Memorizing the experience.
During gameplay all the experiences < s, a, r, s’ > are stored in a replay memory.
When training the network, random mini batches from the replay memory are used
instead of the most recent transition. This breaks the similarity of subsequent
training samples, which otherwise might drive the network into a local minimum.
Also experience replay makes the training task more similar to usual supervised
learning, which simplifies debugging and testing the algorithm.
Exploration and Exploitation
Firstly observe, that when a Q-table or Q-network is initialized randomly, then its
predictions are initially random as well. If we pick an action with the highest Q-value,
the action will be random and the agent performs crude “exploration”. As a Q-
function converges, it returns more consistent Q-values and the amount of
exploration decreases. So one could say, that Q-learning incorporates the
exploration as part of the algorithm. But this exploration is “greedy”, it settles with
the first effective strategy it finds.
A simple and effective fix for the above problem is ε-greedy exploration – with
probability ε choose a random action, otherwise go with the “greedy” action with the
highest Q-value. It actually decreases ε over time from 1 to 0.1 – in the beginning
the system makes completely random moves to explore the state space maximally,
and then it settles down to a fixed exploration rate.
Deep Q-Learning
Thank You

More Related Content

What's hot

Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
Universitat Politècnica de Catalunya
 
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
Hye-min Ahn
 
Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learning
Jie-Han Chen
 
An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)
pauldix
 
Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...
Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...
Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...
Universitat Politècnica de Catalunya
 
An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learning
Big Data Colombia
 
Episodic Policy Gradient Training
Episodic Policy Gradient TrainingEpisodic Policy Gradient Training
Episodic Policy Gradient Training
Hung Le
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
DongHyun Kwak
 
Trust Region Policy Optimization, Schulman et al, 2015
Trust Region Policy Optimization, Schulman et al, 2015Trust Region Policy Optimization, Schulman et al, 2015
Trust Region Policy Optimization, Schulman et al, 2015
Chris Ohk
 
0415_seminar_DeepDPG
0415_seminar_DeepDPG0415_seminar_DeepDPG
0415_seminar_DeepDPG
Hye-min Ahn
 
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Chris Ohk
 
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021
Chris Ohk
 
Deep Multi-agent Reinforcement Learning
Deep Multi-agent Reinforcement LearningDeep Multi-agent Reinforcement Learning
Deep Multi-agent Reinforcement Learning
deawoo Kim
 
Hierarchical Object Detection with Deep Reinforcement Learning
Hierarchical Object Detection with Deep Reinforcement LearningHierarchical Object Detection with Deep Reinforcement Learning
Hierarchical Object Detection with Deep Reinforcement Learning
Universitat Politècnica de Catalunya
 
Aaa ped-24- Reinforcement Learning
Aaa ped-24- Reinforcement LearningAaa ped-24- Reinforcement Learning
Aaa ped-24- Reinforcement Learning
AminaRepo
 
Generalized Reinforcement Learning
Generalized Reinforcement LearningGeneralized Reinforcement Learning
Generalized Reinforcement Learning
Po-Hsiang (Barnett) Chiu
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
DongHyun Kwak
 
Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)
Dongmin Lee
 
The effect of episodic retrieval on inhibition in task switching
The effect of episodic retrieval on inhibition in task switchingThe effect of episodic retrieval on inhibition in task switching
The effect of episodic retrieval on inhibition in task switching
JimGrange
 
Active Object Localization with Deep Reinforcement Learning
Active Object Localization with Deep Reinforcement LearningActive Object Localization with Deep Reinforcement Learning
Active Object Localization with Deep Reinforcement Learning
Universitat Politècnica de Catalunya
 

What's hot (20)

Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
 
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
 
Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learning
 
An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)
 
Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...
Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...
Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...
 
An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learning
 
Episodic Policy Gradient Training
Episodic Policy Gradient TrainingEpisodic Policy Gradient Training
Episodic Policy Gradient Training
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Trust Region Policy Optimization, Schulman et al, 2015
Trust Region Policy Optimization, Schulman et al, 2015Trust Region Policy Optimization, Schulman et al, 2015
Trust Region Policy Optimization, Schulman et al, 2015
 
0415_seminar_DeepDPG
0415_seminar_DeepDPG0415_seminar_DeepDPG
0415_seminar_DeepDPG
 
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
 
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021
 
Deep Multi-agent Reinforcement Learning
Deep Multi-agent Reinforcement LearningDeep Multi-agent Reinforcement Learning
Deep Multi-agent Reinforcement Learning
 
Hierarchical Object Detection with Deep Reinforcement Learning
Hierarchical Object Detection with Deep Reinforcement LearningHierarchical Object Detection with Deep Reinforcement Learning
Hierarchical Object Detection with Deep Reinforcement Learning
 
Aaa ped-24- Reinforcement Learning
Aaa ped-24- Reinforcement LearningAaa ped-24- Reinforcement Learning
Aaa ped-24- Reinforcement Learning
 
Generalized Reinforcement Learning
Generalized Reinforcement LearningGeneralized Reinforcement Learning
Generalized Reinforcement Learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)
 
The effect of episodic retrieval on inhibition in task switching
The effect of episodic retrieval on inhibition in task switchingThe effect of episodic retrieval on inhibition in task switching
The effect of episodic retrieval on inhibition in task switching
 
Active Object Localization with Deep Reinforcement Learning
Active Object Localization with Deep Reinforcement LearningActive Object Localization with Deep Reinforcement Learning
Active Object Localization with Deep Reinforcement Learning
 

Similar to Reinforcement learning

An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game Learning
Prabhu Kumar
 
Head First Reinforcement Learning
Head First Reinforcement LearningHead First Reinforcement Learning
Head First Reinforcement Learning
azzeddine chenine
 
24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx
ManiMaran230751
 
Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning
Chandra Meena
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative ways
YasutoTamura1
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
Ding Li
 
Lecture notes
Lecture notesLecture notes
Lecture notesbutest
 
Introduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement LearningIntroduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement Learning
IDEAS - Int'l Data Engineering and Science Association
 
RL.ppt
RL.pptRL.ppt
RL.ppt
AzharJamil15
 
lecture_21.pptx - PowerPoint Presentation
lecture_21.pptx - PowerPoint Presentationlecture_21.pptx - PowerPoint Presentation
lecture_21.pptx - PowerPoint Presentationbutest
 
Adaptive High-Level Strategy Learning in StarCraft
Adaptive High-Level Strategy Learning in StarCraftAdaptive High-Level Strategy Learning in StarCraft
Adaptive High-Level Strategy Learning in StarCraft
Jiéverson Maissiat
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
MLconf
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
SVijaylakshmi
 
Reinforcement learning Research experiments OpenAI
Reinforcement learning Research experiments OpenAIReinforcement learning Research experiments OpenAI
Reinforcement learning Research experiments OpenAI
Raouf KESKES
 
Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning
재연 윤
 
Survey of Modern Reinforcement Learning
Survey of Modern Reinforcement Learning Survey of Modern Reinforcement Learning
Survey of Modern Reinforcement Learning
Julia Maddalena
 
the bellman equation
 the bellman equation the bellman equation
the bellman equation
Hassan Nazeer Chaudhry
 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-Learning
Kuppusamy P
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement Learning
Natan Katz
 

Similar to Reinforcement learning (20)

CS799_FinalReport
CS799_FinalReportCS799_FinalReport
CS799_FinalReport
 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game Learning
 
Head First Reinforcement Learning
Head First Reinforcement LearningHead First Reinforcement Learning
Head First Reinforcement Learning
 
24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx
 
Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative ways
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Lecture notes
Lecture notesLecture notes
Lecture notes
 
Introduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement LearningIntroduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement Learning
 
RL.ppt
RL.pptRL.ppt
RL.ppt
 
lecture_21.pptx - PowerPoint Presentation
lecture_21.pptx - PowerPoint Presentationlecture_21.pptx - PowerPoint Presentation
lecture_21.pptx - PowerPoint Presentation
 
Adaptive High-Level Strategy Learning in StarCraft
Adaptive High-Level Strategy Learning in StarCraftAdaptive High-Level Strategy Learning in StarCraft
Adaptive High-Level Strategy Learning in StarCraft
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Reinforcement learning Research experiments OpenAI
Reinforcement learning Research experiments OpenAIReinforcement learning Research experiments OpenAI
Reinforcement learning Research experiments OpenAI
 
Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning
 
Survey of Modern Reinforcement Learning
Survey of Modern Reinforcement Learning Survey of Modern Reinforcement Learning
Survey of Modern Reinforcement Learning
 
the bellman equation
 the bellman equation the bellman equation
the bellman equation
 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-Learning
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement Learning
 

More from Elias Hasnat

BLE.pdf
BLE.pdfBLE.pdf
BLE.pdf
Elias Hasnat
 
FacialRecognition-May-8-2020.pdf
FacialRecognition-May-8-2020.pdfFacialRecognition-May-8-2020.pdf
FacialRecognition-May-8-2020.pdf
Elias Hasnat
 
Smart City IoT Solution Improved
Smart City IoT Solution ImprovedSmart City IoT Solution Improved
Smart City IoT Solution Improved
Elias Hasnat
 
Connected vehicle mobility as a service (maas)
Connected vehicle mobility as a service (maas)Connected vehicle mobility as a service (maas)
Connected vehicle mobility as a service (maas)
Elias Hasnat
 
Lorawan for agriculture, haccp hazard analysis and critical control point
Lorawan for agriculture, haccp hazard analysis and critical control pointLorawan for agriculture, haccp hazard analysis and critical control point
Lorawan for agriculture, haccp hazard analysis and critical control point
Elias Hasnat
 
IoT Security with Azure
IoT Security with AzureIoT Security with Azure
IoT Security with Azure
Elias Hasnat
 
産業向け AWS IoT ソリューション
産業向け AWS IoT ソリューション産業向け AWS IoT ソリューション
産業向け AWS IoT ソリューション
Elias Hasnat
 
Soap vs REST-API
Soap vs REST-APISoap vs REST-API
Soap vs REST-API
Elias Hasnat
 
AIIoT組み込みシステム向けIEEE1888通信スタック
AIIoT組み込みシステム向けIEEE1888通信スタックAIIoT組み込みシステム向けIEEE1888通信スタック
AIIoT組み込みシステム向けIEEE1888通信スタック
Elias Hasnat
 
IoT security reference architecture
IoT security  reference architectureIoT security  reference architecture
IoT security reference architecture
Elias Hasnat
 
Intelligent video stream detection platform
Intelligent video stream detection platformIntelligent video stream detection platform
Intelligent video stream detection platform
Elias Hasnat
 
Machine Learning Algorithms
Machine Learning AlgorithmsMachine Learning Algorithms
Machine Learning Algorithms
Elias Hasnat
 
REST API
REST APIREST API
REST API
Elias Hasnat
 
Mqtt
MqttMqtt
Java8 features
Java8 featuresJava8 features
Java8 features
Elias Hasnat
 
Dalvikよりart
DalvikよりartDalvikよりart
Dalvikよりart
Elias Hasnat
 
K means
K meansK means
K means
Elias Hasnat
 
Unity sdk-plugin
Unity sdk-pluginUnity sdk-plugin
Unity sdk-plugin
Elias Hasnat
 
Cocos2dx
Cocos2dxCocos2dx
Cocos2dx
Elias Hasnat
 
China Mobile Market
China Mobile MarketChina Mobile Market
China Mobile Market
Elias Hasnat
 

More from Elias Hasnat (20)

BLE.pdf
BLE.pdfBLE.pdf
BLE.pdf
 
FacialRecognition-May-8-2020.pdf
FacialRecognition-May-8-2020.pdfFacialRecognition-May-8-2020.pdf
FacialRecognition-May-8-2020.pdf
 
Smart City IoT Solution Improved
Smart City IoT Solution ImprovedSmart City IoT Solution Improved
Smart City IoT Solution Improved
 
Connected vehicle mobility as a service (maas)
Connected vehicle mobility as a service (maas)Connected vehicle mobility as a service (maas)
Connected vehicle mobility as a service (maas)
 
Lorawan for agriculture, haccp hazard analysis and critical control point
Lorawan for agriculture, haccp hazard analysis and critical control pointLorawan for agriculture, haccp hazard analysis and critical control point
Lorawan for agriculture, haccp hazard analysis and critical control point
 
IoT Security with Azure
IoT Security with AzureIoT Security with Azure
IoT Security with Azure
 
産業向け AWS IoT ソリューション
産業向け AWS IoT ソリューション産業向け AWS IoT ソリューション
産業向け AWS IoT ソリューション
 
Soap vs REST-API
Soap vs REST-APISoap vs REST-API
Soap vs REST-API
 
AIIoT組み込みシステム向けIEEE1888通信スタック
AIIoT組み込みシステム向けIEEE1888通信スタックAIIoT組み込みシステム向けIEEE1888通信スタック
AIIoT組み込みシステム向けIEEE1888通信スタック
 
IoT security reference architecture
IoT security  reference architectureIoT security  reference architecture
IoT security reference architecture
 
Intelligent video stream detection platform
Intelligent video stream detection platformIntelligent video stream detection platform
Intelligent video stream detection platform
 
Machine Learning Algorithms
Machine Learning AlgorithmsMachine Learning Algorithms
Machine Learning Algorithms
 
REST API
REST APIREST API
REST API
 
Mqtt
MqttMqtt
Mqtt
 
Java8 features
Java8 featuresJava8 features
Java8 features
 
Dalvikよりart
DalvikよりartDalvikよりart
Dalvikよりart
 
K means
K meansK means
K means
 
Unity sdk-plugin
Unity sdk-pluginUnity sdk-plugin
Unity sdk-plugin
 
Cocos2dx
Cocos2dxCocos2dx
Cocos2dx
 
China Mobile Market
China Mobile MarketChina Mobile Market
China Mobile Market
 

Recently uploaded

在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
obonagu
 
Runway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptxRunway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptx
SupreethSP4
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
AJAYKUMARPUND1
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
Massimo Talia
 
AP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specificAP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specific
BrazilAccount1
 
power quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptxpower quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptx
ViniHema
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
TeeVichai
 
ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
Jayaprasanna4
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
thanhdowork
 
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
Pipe Restoration Solutions
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
Kamal Acharya
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Dr.Costas Sachpazis
 
CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
karthi keyan
 
space technology lecture notes on satellite
space technology lecture notes on satellitespace technology lecture notes on satellite
space technology lecture notes on satellite
ongomchris
 
ML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptxML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptx
Vijay Dialani, PhD
 
block diagram and signal flow graph representation
block diagram and signal flow graph representationblock diagram and signal flow graph representation
block diagram and signal flow graph representation
Divya Somashekar
 
English lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdfEnglish lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdf
BrazilAccount1
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
AhmedHussein950959
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
bakpo1
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
Neometrix_Engineering_Pvt_Ltd
 

Recently uploaded (20)

在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
 
Runway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptxRunway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptx
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
 
AP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specificAP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specific
 
power quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptxpower quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptx
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
 
ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
 
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
 
CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
 
space technology lecture notes on satellite
space technology lecture notes on satellitespace technology lecture notes on satellite
space technology lecture notes on satellite
 
ML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptxML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptx
 
block diagram and signal flow graph representation
block diagram and signal flow graph representationblock diagram and signal flow graph representation
block diagram and signal flow graph representation
 
English lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdfEnglish lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdf
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
 

Reinforcement learning

  • 2.
  • 3. Definition of Reinforcement Learning We can explain reinforcement learning as a combination of supervised and unsupervised learning. 1. In supervised learning one has a target label for each training example and 2. In unsupervised learning one has no labels at all, 3. In reinforcement learning one has sparse and time- delayed labels (the rewards). Based only on those rewards the agent has to learn to behave in the environment.
  • 4. 1. Credit assignment problem and 2. Exploration-Exploitation Dilemma Steps in consideration in RL problem solving Credit Assignment Step is used for defining the strategy, Exploration-Exploitation is used for improving the reward outcomes.
  • 6.
  • 7. Mathematical relation formalization ● An agent, situated in an environment. ● The environment is in a certain state. ● The agent can perform certain actions in the environment. ● These actions sometimes result in a reward. ● Actions transform the environment and lead to a new state, where the agent can perform another action, and so on. ● The rules for how you choose those actions are called policy. ● The environment in general is stochastic, which means the next state may be somewhat random. ● The set of states and actions, together with rules for transitioning from one state to another, make up a Markov decision process. ● One episode of this process forms a finite sequence of states, actions and rewards.
  • 8. Reward Strategy To perform well in the long-term, we need to take into account not only the immediate rewards, but also the future rewards we are going to get. total reward function total reward function from time point t But as the environment is stochastic we never get same reward in the next run, that is why we be considering discounted future reward γ is the discount factor between 0 and 1, as the seps increased the γ value goes near to 0.
  • 9. Discounted future reward calculation Discounted future reward at time step t can be expressed in terms of the same thing at time step t+1: If we set the discount factor γ=0, then our strategy will be short-sighted and we rely only on the immediate rewards. If we want to balance between immediate and future rewards, we should set discount factor to something like γ=0.9. If our environment is deterministic and the same actions always result in same rewards, then we can set discount factor γ=1. A good strategy for an agent would be to always choose an action that maximizes the (discounted) future reward.
  • 10. The Q-function is for improved quality in action The Q-Learning function Q(s, a) could be represented for the maximum discounted future reward when we perform action a in state s, and continue optimally from that point on: Suppose you are in state and thinking whether you should take action a or b. You want to select the action that results in the highest score at the end of game. π represents the policy, Now Let’s focus on just one transition <s, a, r, s’>. we can express the Q-value of state s and action a in terms of the Q-value of the next state s’. This is called the Bellman Equation.
  • 11. Q-Learning Formulation α in the algorithm is a learning rate that controls how much of the difference between previous Q-value and newly proposed Q-value is taken into account. In particular, when α=1, then two Q[s,a] cancel and the update is exactly the same as the Bellman equation.
  • 12. Deep Q Network Neural networks are exceptionally good at coming up with good features for highly structured data. We can easily represent our Q-function with a neural network, Use-case 1: that takes the state and action as input and outputs the corresponding Q-value. Use-case 2: Input the state as input and output the Q-value for each possible action. * The benefit of the later use-case over the former one is that if we want to perform a Q-value update or pick the action with the highest Q-value, we only have to do one forward pass through the network and have all Q-values for all actions available immediately.
  • 13. Rewrite the Quality function using Q Network Given a transition < s, a, r, s’ >, the Q-table update rule in the previous algorithm must be replaced with the following: 1. Do a feedforward pass for the current state s to get predicted Q-values for all actions. 2. Do a feedforward pass for the next state s’ and calculate maximum overall network outputs max a’ Q(s’, a’). 3. Set Q-value target for action to r + γmax a’ Q(s’, a’) (use the max calculated in step 2). For all other actions, set the Q-value target to the same as originally returned from step 1, making the error 0 for those outputs. 4. Update the weights using backpropagation.
  • 14. Experience Replay (Memory) Estimate the future reward in each state using Q-learning and approximate the Q- function using neural network. Challenge: But the approximation of Q-values using non-linear functions is very time costly. Solution: Memorizing the experience. During gameplay all the experiences < s, a, r, s’ > are stored in a replay memory. When training the network, random mini batches from the replay memory are used instead of the most recent transition. This breaks the similarity of subsequent training samples, which otherwise might drive the network into a local minimum. Also experience replay makes the training task more similar to usual supervised learning, which simplifies debugging and testing the algorithm.
  • 15. Exploration and Exploitation Firstly observe, that when a Q-table or Q-network is initialized randomly, then its predictions are initially random as well. If we pick an action with the highest Q-value, the action will be random and the agent performs crude “exploration”. As a Q- function converges, it returns more consistent Q-values and the amount of exploration decreases. So one could say, that Q-learning incorporates the exploration as part of the algorithm. But this exploration is “greedy”, it settles with the first effective strategy it finds. A simple and effective fix for the above problem is ε-greedy exploration – with probability ε choose a random action, otherwise go with the “greedy” action with the highest Q-value. It actually decreases ε over time from 1 to 0.1 – in the beginning the system makes completely random moves to explore the state space maximally, and then it settles down to a fixed exploration rate.