SlideShare a Scribd company logo
Introduction toDeep
Reinforcement Learning
By:
Reyhane Akhavan Kharazi
Mohammad Hossein Modirrousta
Types of Machine
Learning
Machine Learning
Supervised
Learning a generalized model of
data based on labeled
examples
Unsupervised
Drawing inferences from
unlabeled set of data
Reinforcement
Agent learns how to interact with
the environment based on the
experience and gained reward
What is Reinforcement Learning(RL)?
Action at
Reward Rt+1
State St+1
Example of RL
Agent start from the point (1, 1) and move on
to reach the Goal:
State = (1, 1)
Action = Right
new State = (1, 2)
reward = -1
Some definitions
Markov Process
● Markov Process or Markov Chain is a stochastic (random) process that satisfies Markov
property.
● Markov Property assume memorylessness, which means that predictions about the future of the
process can be made based only on the current state, without any knowledge about the historical
states.
● p(St+1
|S1
, … , St
) = p(St+1
|St
)
Markov Process
S0
S1
S3 S2
● Markov Process is characterized by:
○ States : The discrete states of a process at any time
○ Transition probability: The probability of moving from one state to another
0.4
0.6
0.5 0.5
0.3 0.7
S S′ P
S0
S1
0.6
S0
S0
0.4
S1
S2
0.5
S1
S3
0.5
S2
S2
0.7
S2
S3
0.3
Markov Reward Process(MRP)
● A Markov Reward Process or an MRP is a Markov
process with value judgment, saying how much reward
accumulated through some particular sequence that we
sampled.
● MRP is a tuple (S, P, R, 𝛄):
○ S is finite set of states
○ P is transition probability matrix
■ Pss’
= p(St+1
= s’|St
= s)
○ R is a reward function:
■ Rs,a
= E [Rt+1
| St
= s]
■ It is immediate reward
○ 𝛄 is a discount factor, 𝛄 ∈ [0, 1]
0.4
S0
S1
S3 S2
0.6
0.5 0.5
0.3 0.7
R = -1 R = +2
R = -1
R = +5
Return
- Our goal is to maximize the return.
- The return Gt
is the total discount reward from time step t.
- The discount factor γ is a value between 0 and 1. If gamma is closer 0 it leads to
short sighted evaluation, while a value closer to 1 favors far sighted evaluation.
State Value
Function
State Value Function v(s): gives the long-term value of state s. It is the expected return
starting from state s
Value
Function
0.4
S0
S1
S3 S2
0.6
0.5 0.5
0.3
0.7
R = -1 R = +2
R = -1
R = +5
0.4
0.5 0.5
-1 +2
-1
+5
0.6
0.5 0.5
-0.2 +4
-0.2
+5
0.3 0.7
v(s0
) = -1 + 1(0.4*-1 + 0.6*2) = -0.2
v(s1
) = 2 + 1(0.5*-1 + 0.5*5) = 4
v(s2
) = -1 + 1(0.7*-1 + 0.3*5) = -0.2
v(s3
) = +5
0.3 0.7
v(s0
) = -1 + 1(0.4*-0.2 + 0.6*4) = 1.32
v(s1
) = 2 + 1(0.5*-0.2 + 0.5*5) = 4.4
v(s2
) = -1 + 1(0.7*-0.2 + 0.3*5) = 0.36
v(s3
) = +5
0.4
0.6
0.5
0.7
0.3
0.5
3.66 5.33
1.66
+5
Value
Iteration
Iteration 0
0.4
0.6
Iteration 1 Final Iteration
…
Markov Decision Process(MDP)
● MDP can be represented as follows:
𝐬𝟎 → → 𝐬𝟏 → → 𝐬𝟐 → → ⋯
● MDP is a tuple (S, A, P, R, 𝛄):
○ S is finite set of states
○ A is finite set of actions
○ P is transition probability matrix
■ Pss’
= p(St+1
= s’|St
= s,At
= a)
○ R is a reward function:
s,a t+1 t t
■ R = E [R | S = s,A = a]
■ It is immediate reward
○ 𝛄 is a discount factor, 𝛄 ∈ [0, 1]
a r a r
a r
0 1 1 2
2 3
S0
S2
S1
S3
a0
a1
a2
0.5
0.5
0.6
0.4
1.0
R = -1
R = -1
R = +2
R = +5
Policy
A policy π is a distribution over actions given states. It fully defines the behavior of an agent.
MDP policies depend on the current state and not the history.
Value Function for
MDP
The state-value function vπ
(s) of an MDP is the expected return starting from state s, and
then following policy π.
State-value function tells us how good is it to be in state s by following policy π.
Action Value
Function
The action-value function qπ
(s, a) is the expected return starting from state s, taking action
a, and then following policy π.
Action-value function tells us how good is it to take a particular action from a particular state.
Gives us an idea on what action we should take at states.
Ways to solve
...
There are different ways to solve this problem.
● Policy Iteration, where our focus is to find optimal policy (model based)
● Value Iteration, where our focus is to find optimal value, i.e. cumulative reward (model
based)
● Q-Learning, where our focus is to find quality of actions in each state (model free)
Solving multi-armed bandit problem
Multi-arm Bandit
● A one-armed bandit is a simple slot machine wherein you insert a coin into the machine,
pull a lever, and get an immediate reward. (but in this lecture we assume this is free to test
each machine)
● In multi-armed bandit problem We have an agent which we allow to choose actions, and
each action has a reward that is returned according to a given, underlying probability
distribution. The game is played over many episodes (single actions in this case) and the
goal is to maximize your reward.
Exploration & Exploitation
● When we first start playing, we need to play the game and observe the rewards we get
for the various machines. We can call this strategy exploration, since we’re essentially
randomly exploring the results of our actions.
● There is a different strategy we could employ called exploitation, which means that we
use our current knowledge about which machine seems to produce the most rewards.
● Our overall strategy needs to include some amount of exploitation (choosing the best
lever based on what we know so far) and some amount of exploration (choosing
random levers so we can learn more).
Epsilon-greedy strategy
In epsilon-greedy strategy we choose the action based on some exploration and some
exploitation.
with a probability, ε, we will choose an action, a, at random, and the rest of the time
(probability 1 – ε) we will choose the best lever based on what we currently know from past
plays.
Solving the n-armed bandit
#Initialize the eps to balance the exploration and exploitation
eps = 0.2
for i in range(number_of_iterations):
if random.random() > eps:
# Exploitation: choose the best arm according to it's average reward
selected_arm = choose_the_best_arm()
else:
# Exploration: select an arm randomly
selected_arm = random_selection(number_of_arms)
# pull the selected arm and get the immediate reward
immediate_reward = get_reward(selected_arm)
# we should update the reward of the selected arm and add it to our history
update_mean_reward(selected_arm, immediate_reward)
Q-learning
“Q-learning is an off policy reinforcement learning
algorithm that seeks to find the best action to take given
the current state. It’s considered off-policy because the q-
learning function learns from actions that are outside the
current policy, like taking random actions, and therefore a
policy isn’t needed. More specifically, q-learning seeks to
learn a policy that maximizes the total reward.”
Q-learning
Q(St
,At
) : Prediction of model
Rt+1
+ 𝛄 max Q(St+1
,a) : Estimation of target value
Q-Learning Example
Q =
s1
s2
s3
Q′(s,a) = 3 + 0.01 * [Rt+1
+ 0.9 max Q(St+1
,a) - Q(St
,a)] = 3 + 0.01*[4 + 0.9*10 - 3] = 3.1
Assume we are in state s1
and we choose action a2. This action will take us to state s3.
The reward of env to our action is +4.
Learning rate = 0.01
Discounted factor = 0.9
a0
a1
a2
a3
a4
a5
a0
a1
a2
a3
a4
a5
s0 12 1 3 1 10 6 s0 12 1 3
1
10 6
0 1 3 0 1 2 Q′ =
s1 0 1 3.1 0 1 2
8 5 0 1 0 2 s2 8 5 0
1
0 2
0 1 3 9 0 10 s3 0 1 3
9
0 10
Large scale Reinforcement learning
● Reinforcement learning can be used to solve large problems
○ Backgammon: 1020
states
○ Go: 1070
states
○ Atari games, Helicopter, …
● So far we mostly considered lookup tables
○ Every state-action pair s, a has an entry q(s, a)
● Problem with large MDPs:
○ There are too many states & actions to store in memory
○ It is too slow to learn the value of each state individually
● Solution:
○ We need to approximate the Q function.
Q function
The original Q function accepts a state-action pair and returns the value of that state-action pair—a
single number.
DeepMind used a modified vector-valued Q function that accepts a state and returns a vector of
state-action values, one for each possible action given the input state. The vector-valued Q function
is more efficient, since you only need to compute the function once for all the actions.
Deep Q-learning : Building the network
● The last layer will simply produce an output vector of Q values—one for each possible action.
● In this lecture we use the epsilon-greedy approach for action selection.
● instead of using a static ε value, we will initialize it to a large value and we will slowly
decrement it. In this way, we will allow the algorithm to explore and learn a lot in the
beginning, but then it will settle into maximizing rewards by exploiting what it has learned.
Gridworld Example
The board of
game
This is how the Gridworld board is represented as a
numpy array.
Each matrix encodes the position of one of the four
objects: the player, the goal, the pit, and the wall.
Neural network as a Q function
Deep Q-learning
Algorithm
Initialize action-value function(weights of the network) with random
weights For episode = 1,M do:
Initialize the game and get starting state s
For t = 1,T do:
With probability ε select a random action at
; Otherwise select at
= maxa
Q(s, a)
Take action at
, and observe the new state s′ and reward rt+1
.
Run the network forward using s′. Store the highest Q value, which we’ll call maxQ = maxa
Q(s′,a)
if the game continues
rt+1
+ γ *maxQ
r
t+1
if the game is over
Train the model with this sample
=>
s = s′
If the game is over break; else continue
target value =
final_target = model.predict(state)
final_target[action] = target value
model.fit(state, final_target)
Double DQN and Dueling DQN
• Double DQN: Decouple selection and evaluation
• Dueling DQN: Split Q-value into advantage function and value function
Classification Markov Decision Process
● CMDP is a tuple (S,A,P, R ):
○ S is training samples
○ A is Labeling on samples
○ P is transition probability matrix
■ Pss’
= p(St+1
= s’|St
= s,At
= a)
■ R = 1 when the agent correctly recognizes a label
■ R= -1 otherwise
○ R is a reward function:
"Intelligent Fault Diagnosis for Planetary Gearbox Using Time-Frequency Representation and Deep Reinforcement Learning." IEEE/ASME Transactions on Mechatronics (2021).
Summary
●
●
● RL is a goal-oriented learning based on interaction with environment.
Gt
is the total discounted rewards from time step t. This is what we care about, the goal is to
maximize this return
The action-value function qπ
(s,a) is the expected return starting from state s, taking action a,
and then following policy π.
● The main idea of Q-learning is that your algorithm predicts the value of a state-action pair, and
then you compare this prediction to the observed accumulated rewards at some later time and
update the parameters of your algorithm, so that next time it will make better predictions.
● There are too many states and actions in large scale problems So we can not completely find the
optimal q-function.
Summary
● In large scale Problems we need to approximate the q_function and it can
be done with using the neural network architecture.
Resources
- https://www.youtube.com/watch?
v=2pWv7GOvuf0&list=PLqYmG7hTraZBiG_XpjnPrSNw-1XQaM_gB&ab_channel=DeepMin
d
- https://www.analyticsvidhya.com/blog/2017/01/introduction-to-reinforcement-learning
- implementation/
- https://www.youtube.com/playlist?list=PL2-dafEMk2A5FZ-MnPMpp3PBtZcINKwLA
- https://towardsdatascience.com/reinforcement-learning-demystified-markov-decision-
processes-part-1-bf00dda41690
- https://towardsdatascience.com/reinforcement-learning-an-introduction-to-the-concepts-
applications-and-code-ced6fbfd882d
- https://deeplizard.com/learn/video/QK_PP_2KgGE
- https://astrobear.top/2020/02/23/RLSummary6/
Thank
You

More Related Content

Similar to Deep RL.pdf

Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
Ding Li
 
Designing an AI that gains experience for absolute beginners
Designing an AI that gains experience for absolute beginnersDesigning an AI that gains experience for absolute beginners
Designing an AI that gains experience for absolute beginners
Tanzim Saqib
 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-Learning
Kuppusamy P
 
lecture_21.pptx - PowerPoint Presentation
lecture_21.pptx - PowerPoint Presentationlecture_21.pptx - PowerPoint Presentation
lecture_21.pptx - PowerPoint Presentation
butest
 
TensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksTensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and Tricks
Ben Ball
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
MeetupDataScienceRoma
 
Machine learning (13)
Machine learning (13)Machine learning (13)
Machine learning (13)
NYversity
 
Fundamentals of RL.pptx
Fundamentals of RL.pptxFundamentals of RL.pptx
Fundamentals of RL.pptx
ManiMaran230751
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
Elias Hasnat
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
Usman Qayyum
 
Reinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsReinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural Nets
Pierre de Lacaze
 
RL.ppt
RL.pptRL.ppt
RL.ppt
AzharJamil15
 
Lecture notes
Lecture notesLecture notes
Lecture notes
butest
 
Deep einforcement learning
Deep einforcement learningDeep einforcement learning
Deep einforcement learning
OswaldoAndrsOrdezBol
 
Playing Atari with Deep Reinforcement Learning
Playing Atari with Deep Reinforcement LearningPlaying Atari with Deep Reinforcement Learning
Playing Atari with Deep Reinforcement Learning
Willy Marroquin (WillyDevNET)
 
Reinforcement Learning - DQN
Reinforcement Learning - DQNReinforcement Learning - DQN
Reinforcement Learning - DQN
Mohammaderfan Arefimoghaddam
 
Goprez sg
Goprez  sgGoprez  sg
Goprez sg
Sean Golliher
 
Poster - black background
Poster -  black backgroundPoster -  black background
Poster - black background
Thành Lê-Đình
 
Reinforcement-Learning.ppt
Reinforcement-Learning.pptReinforcement-Learning.ppt
Reinforcement-Learning.ppt
Tusharchauhan939328
 
Introduction to Algorithms
Introduction to AlgorithmsIntroduction to Algorithms
Introduction to Algorithms
pppepito86
 

Similar to Deep RL.pdf (20)

Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Designing an AI that gains experience for absolute beginners
Designing an AI that gains experience for absolute beginnersDesigning an AI that gains experience for absolute beginners
Designing an AI that gains experience for absolute beginners
 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-Learning
 
lecture_21.pptx - PowerPoint Presentation
lecture_21.pptx - PowerPoint Presentationlecture_21.pptx - PowerPoint Presentation
lecture_21.pptx - PowerPoint Presentation
 
TensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksTensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and Tricks
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
Machine learning (13)
Machine learning (13)Machine learning (13)
Machine learning (13)
 
Fundamentals of RL.pptx
Fundamentals of RL.pptxFundamentals of RL.pptx
Fundamentals of RL.pptx
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
Reinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsReinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural Nets
 
RL.ppt
RL.pptRL.ppt
RL.ppt
 
Lecture notes
Lecture notesLecture notes
Lecture notes
 
Deep einforcement learning
Deep einforcement learningDeep einforcement learning
Deep einforcement learning
 
Playing Atari with Deep Reinforcement Learning
Playing Atari with Deep Reinforcement LearningPlaying Atari with Deep Reinforcement Learning
Playing Atari with Deep Reinforcement Learning
 
Reinforcement Learning - DQN
Reinforcement Learning - DQNReinforcement Learning - DQN
Reinforcement Learning - DQN
 
Goprez sg
Goprez  sgGoprez  sg
Goprez sg
 
Poster - black background
Poster -  black backgroundPoster -  black background
Poster - black background
 
Reinforcement-Learning.ppt
Reinforcement-Learning.pptReinforcement-Learning.ppt
Reinforcement-Learning.ppt
 
Introduction to Algorithms
Introduction to AlgorithmsIntroduction to Algorithms
Introduction to Algorithms
 

Recently uploaded

Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...
Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...
Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...
kalichargn70th171
 
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Neo4j
 
Launch Your Streaming Platforms in Minutes
Launch Your Streaming Platforms in MinutesLaunch Your Streaming Platforms in Minutes
Launch Your Streaming Platforms in Minutes
Roshan Dwivedi
 
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j
 
SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024
Hironori Washizaki
 
Artificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension FunctionsArtificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension Functions
Octavian Nadolu
 
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
mz5nrf0n
 
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket ManagementUtilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate
 
Empowering Growth with Best Software Development Company in Noida - Deuglo
Empowering Growth with Best Software  Development Company in Noida - DeugloEmpowering Growth with Best Software  Development Company in Noida - Deuglo
Empowering Growth with Best Software Development Company in Noida - Deuglo
Deuglo Infosystem Pvt Ltd
 
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Łukasz Chruściel
 
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, FactsALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
Green Software Development
 
Transform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR SolutionsTransform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR Solutions
TheSMSPoint
 
Graspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code AnalysisGraspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code Analysis
Aftab Hussain
 
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeA Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
Aftab Hussain
 
GOING AOT WITH GRAALVM FOR SPRING BOOT (SPRING IO)
GOING AOT WITH GRAALVM FOR  SPRING BOOT (SPRING IO)GOING AOT WITH GRAALVM FOR  SPRING BOOT (SPRING IO)
GOING AOT WITH GRAALVM FOR SPRING BOOT (SPRING IO)
Alina Yurenko
 
Preparing Non - Technical Founders for Engaging a Tech Agency
Preparing Non - Technical Founders for Engaging  a  Tech AgencyPreparing Non - Technical Founders for Engaging  a  Tech Agency
Preparing Non - Technical Founders for Engaging a Tech Agency
ISH Technologies
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
Max Andersen
 
What is Augmented Reality Image Tracking
What is Augmented Reality Image TrackingWhat is Augmented Reality Image Tracking
What is Augmented Reality Image Tracking
pavan998932
 
E-commerce Application Development Company.pdf
E-commerce Application Development Company.pdfE-commerce Application Development Company.pdf
E-commerce Application Development Company.pdf
Hornet Dynamics
 
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j
 

Recently uploaded (20)

Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...
Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...
Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...
 
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
 
Launch Your Streaming Platforms in Minutes
Launch Your Streaming Platforms in MinutesLaunch Your Streaming Platforms in Minutes
Launch Your Streaming Platforms in Minutes
 
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
 
SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024
 
Artificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension FunctionsArtificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension Functions
 
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
 
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket ManagementUtilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
 
Empowering Growth with Best Software Development Company in Noida - Deuglo
Empowering Growth with Best Software  Development Company in Noida - DeugloEmpowering Growth with Best Software  Development Company in Noida - Deuglo
Empowering Growth with Best Software Development Company in Noida - Deuglo
 
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
 
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, FactsALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
 
Transform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR SolutionsTransform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR Solutions
 
Graspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code AnalysisGraspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code Analysis
 
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeA Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
 
GOING AOT WITH GRAALVM FOR SPRING BOOT (SPRING IO)
GOING AOT WITH GRAALVM FOR  SPRING BOOT (SPRING IO)GOING AOT WITH GRAALVM FOR  SPRING BOOT (SPRING IO)
GOING AOT WITH GRAALVM FOR SPRING BOOT (SPRING IO)
 
Preparing Non - Technical Founders for Engaging a Tech Agency
Preparing Non - Technical Founders for Engaging  a  Tech AgencyPreparing Non - Technical Founders for Engaging  a  Tech Agency
Preparing Non - Technical Founders for Engaging a Tech Agency
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
 
What is Augmented Reality Image Tracking
What is Augmented Reality Image TrackingWhat is Augmented Reality Image Tracking
What is Augmented Reality Image Tracking
 
E-commerce Application Development Company.pdf
E-commerce Application Development Company.pdfE-commerce Application Development Company.pdf
E-commerce Application Development Company.pdf
 
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
 

Deep RL.pdf

  • 1. Introduction toDeep Reinforcement Learning By: Reyhane Akhavan Kharazi Mohammad Hossein Modirrousta
  • 2. Types of Machine Learning Machine Learning Supervised Learning a generalized model of data based on labeled examples Unsupervised Drawing inferences from unlabeled set of data Reinforcement Agent learns how to interact with the environment based on the experience and gained reward
  • 3. What is Reinforcement Learning(RL)? Action at Reward Rt+1 State St+1
  • 4. Example of RL Agent start from the point (1, 1) and move on to reach the Goal: State = (1, 1) Action = Right new State = (1, 2) reward = -1
  • 6. Markov Process ● Markov Process or Markov Chain is a stochastic (random) process that satisfies Markov property. ● Markov Property assume memorylessness, which means that predictions about the future of the process can be made based only on the current state, without any knowledge about the historical states. ● p(St+1 |S1 , … , St ) = p(St+1 |St )
  • 7. Markov Process S0 S1 S3 S2 ● Markov Process is characterized by: ○ States : The discrete states of a process at any time ○ Transition probability: The probability of moving from one state to another 0.4 0.6 0.5 0.5 0.3 0.7 S S′ P S0 S1 0.6 S0 S0 0.4 S1 S2 0.5 S1 S3 0.5 S2 S2 0.7 S2 S3 0.3
  • 8. Markov Reward Process(MRP) ● A Markov Reward Process or an MRP is a Markov process with value judgment, saying how much reward accumulated through some particular sequence that we sampled. ● MRP is a tuple (S, P, R, 𝛄): ○ S is finite set of states ○ P is transition probability matrix ■ Pss’ = p(St+1 = s’|St = s) ○ R is a reward function: ■ Rs,a = E [Rt+1 | St = s] ■ It is immediate reward ○ 𝛄 is a discount factor, 𝛄 ∈ [0, 1] 0.4 S0 S1 S3 S2 0.6 0.5 0.5 0.3 0.7 R = -1 R = +2 R = -1 R = +5
  • 9. Return - Our goal is to maximize the return. - The return Gt is the total discount reward from time step t. - The discount factor γ is a value between 0 and 1. If gamma is closer 0 it leads to short sighted evaluation, while a value closer to 1 favors far sighted evaluation.
  • 10. State Value Function State Value Function v(s): gives the long-term value of state s. It is the expected return starting from state s
  • 12. 0.4 0.5 0.5 -1 +2 -1 +5 0.6 0.5 0.5 -0.2 +4 -0.2 +5 0.3 0.7 v(s0 ) = -1 + 1(0.4*-1 + 0.6*2) = -0.2 v(s1 ) = 2 + 1(0.5*-1 + 0.5*5) = 4 v(s2 ) = -1 + 1(0.7*-1 + 0.3*5) = -0.2 v(s3 ) = +5 0.3 0.7 v(s0 ) = -1 + 1(0.4*-0.2 + 0.6*4) = 1.32 v(s1 ) = 2 + 1(0.5*-0.2 + 0.5*5) = 4.4 v(s2 ) = -1 + 1(0.7*-0.2 + 0.3*5) = 0.36 v(s3 ) = +5 0.4 0.6 0.5 0.7 0.3 0.5 3.66 5.33 1.66 +5 Value Iteration Iteration 0 0.4 0.6 Iteration 1 Final Iteration …
  • 13. Markov Decision Process(MDP) ● MDP can be represented as follows: 𝐬𝟎 → → 𝐬𝟏 → → 𝐬𝟐 → → ⋯ ● MDP is a tuple (S, A, P, R, 𝛄): ○ S is finite set of states ○ A is finite set of actions ○ P is transition probability matrix ■ Pss’ = p(St+1 = s’|St = s,At = a) ○ R is a reward function: s,a t+1 t t ■ R = E [R | S = s,A = a] ■ It is immediate reward ○ 𝛄 is a discount factor, 𝛄 ∈ [0, 1] a r a r a r 0 1 1 2 2 3 S0 S2 S1 S3 a0 a1 a2 0.5 0.5 0.6 0.4 1.0 R = -1 R = -1 R = +2 R = +5
  • 14. Policy A policy π is a distribution over actions given states. It fully defines the behavior of an agent. MDP policies depend on the current state and not the history.
  • 15. Value Function for MDP The state-value function vπ (s) of an MDP is the expected return starting from state s, and then following policy π. State-value function tells us how good is it to be in state s by following policy π.
  • 16. Action Value Function The action-value function qπ (s, a) is the expected return starting from state s, taking action a, and then following policy π. Action-value function tells us how good is it to take a particular action from a particular state. Gives us an idea on what action we should take at states.
  • 17. Ways to solve ... There are different ways to solve this problem. ● Policy Iteration, where our focus is to find optimal policy (model based) ● Value Iteration, where our focus is to find optimal value, i.e. cumulative reward (model based) ● Q-Learning, where our focus is to find quality of actions in each state (model free)
  • 19. Multi-arm Bandit ● A one-armed bandit is a simple slot machine wherein you insert a coin into the machine, pull a lever, and get an immediate reward. (but in this lecture we assume this is free to test each machine) ● In multi-armed bandit problem We have an agent which we allow to choose actions, and each action has a reward that is returned according to a given, underlying probability distribution. The game is played over many episodes (single actions in this case) and the goal is to maximize your reward.
  • 20.
  • 21. Exploration & Exploitation ● When we first start playing, we need to play the game and observe the rewards we get for the various machines. We can call this strategy exploration, since we’re essentially randomly exploring the results of our actions. ● There is a different strategy we could employ called exploitation, which means that we use our current knowledge about which machine seems to produce the most rewards. ● Our overall strategy needs to include some amount of exploitation (choosing the best lever based on what we know so far) and some amount of exploration (choosing random levers so we can learn more).
  • 22. Epsilon-greedy strategy In epsilon-greedy strategy we choose the action based on some exploration and some exploitation. with a probability, ε, we will choose an action, a, at random, and the rest of the time (probability 1 – ε) we will choose the best lever based on what we currently know from past plays.
  • 23. Solving the n-armed bandit #Initialize the eps to balance the exploration and exploitation eps = 0.2 for i in range(number_of_iterations): if random.random() > eps: # Exploitation: choose the best arm according to it's average reward selected_arm = choose_the_best_arm() else: # Exploration: select an arm randomly selected_arm = random_selection(number_of_arms) # pull the selected arm and get the immediate reward immediate_reward = get_reward(selected_arm) # we should update the reward of the selected arm and add it to our history update_mean_reward(selected_arm, immediate_reward)
  • 24. Q-learning “Q-learning is an off policy reinforcement learning algorithm that seeks to find the best action to take given the current state. It’s considered off-policy because the q- learning function learns from actions that are outside the current policy, like taking random actions, and therefore a policy isn’t needed. More specifically, q-learning seeks to learn a policy that maximizes the total reward.”
  • 25. Q-learning Q(St ,At ) : Prediction of model Rt+1 + 𝛄 max Q(St+1 ,a) : Estimation of target value
  • 26. Q-Learning Example Q = s1 s2 s3 Q′(s,a) = 3 + 0.01 * [Rt+1 + 0.9 max Q(St+1 ,a) - Q(St ,a)] = 3 + 0.01*[4 + 0.9*10 - 3] = 3.1 Assume we are in state s1 and we choose action a2. This action will take us to state s3. The reward of env to our action is +4. Learning rate = 0.01 Discounted factor = 0.9 a0 a1 a2 a3 a4 a5 a0 a1 a2 a3 a4 a5 s0 12 1 3 1 10 6 s0 12 1 3 1 10 6 0 1 3 0 1 2 Q′ = s1 0 1 3.1 0 1 2 8 5 0 1 0 2 s2 8 5 0 1 0 2 0 1 3 9 0 10 s3 0 1 3 9 0 10
  • 27. Large scale Reinforcement learning ● Reinforcement learning can be used to solve large problems ○ Backgammon: 1020 states ○ Go: 1070 states ○ Atari games, Helicopter, … ● So far we mostly considered lookup tables ○ Every state-action pair s, a has an entry q(s, a) ● Problem with large MDPs: ○ There are too many states & actions to store in memory ○ It is too slow to learn the value of each state individually ● Solution: ○ We need to approximate the Q function.
  • 28. Q function The original Q function accepts a state-action pair and returns the value of that state-action pair—a single number. DeepMind used a modified vector-valued Q function that accepts a state and returns a vector of state-action values, one for each possible action given the input state. The vector-valued Q function is more efficient, since you only need to compute the function once for all the actions.
  • 29. Deep Q-learning : Building the network ● The last layer will simply produce an output vector of Q values—one for each possible action. ● In this lecture we use the epsilon-greedy approach for action selection. ● instead of using a static ε value, we will initialize it to a large value and we will slowly decrement it. In this way, we will allow the algorithm to explore and learn a lot in the beginning, but then it will settle into maximizing rewards by exploiting what it has learned.
  • 30. Gridworld Example The board of game This is how the Gridworld board is represented as a numpy array. Each matrix encodes the position of one of the four objects: the player, the goal, the pit, and the wall.
  • 31. Neural network as a Q function
  • 32. Deep Q-learning Algorithm Initialize action-value function(weights of the network) with random weights For episode = 1,M do: Initialize the game and get starting state s For t = 1,T do: With probability ε select a random action at ; Otherwise select at = maxa Q(s, a) Take action at , and observe the new state s′ and reward rt+1 . Run the network forward using s′. Store the highest Q value, which we’ll call maxQ = maxa Q(s′,a) if the game continues rt+1 + γ *maxQ r t+1 if the game is over Train the model with this sample => s = s′ If the game is over break; else continue target value = final_target = model.predict(state) final_target[action] = target value model.fit(state, final_target)
  • 33. Double DQN and Dueling DQN • Double DQN: Decouple selection and evaluation • Dueling DQN: Split Q-value into advantage function and value function
  • 34. Classification Markov Decision Process ● CMDP is a tuple (S,A,P, R ): ○ S is training samples ○ A is Labeling on samples ○ P is transition probability matrix ■ Pss’ = p(St+1 = s’|St = s,At = a) ■ R = 1 when the agent correctly recognizes a label ■ R= -1 otherwise ○ R is a reward function: "Intelligent Fault Diagnosis for Planetary Gearbox Using Time-Frequency Representation and Deep Reinforcement Learning." IEEE/ASME Transactions on Mechatronics (2021).
  • 35. Summary ● ● ● RL is a goal-oriented learning based on interaction with environment. Gt is the total discounted rewards from time step t. This is what we care about, the goal is to maximize this return The action-value function qπ (s,a) is the expected return starting from state s, taking action a, and then following policy π. ● The main idea of Q-learning is that your algorithm predicts the value of a state-action pair, and then you compare this prediction to the observed accumulated rewards at some later time and update the parameters of your algorithm, so that next time it will make better predictions. ● There are too many states and actions in large scale problems So we can not completely find the optimal q-function.
  • 36. Summary ● In large scale Problems we need to approximate the q_function and it can be done with using the neural network architecture.
  • 37. Resources - https://www.youtube.com/watch? v=2pWv7GOvuf0&list=PLqYmG7hTraZBiG_XpjnPrSNw-1XQaM_gB&ab_channel=DeepMin d - https://www.analyticsvidhya.com/blog/2017/01/introduction-to-reinforcement-learning - implementation/ - https://www.youtube.com/playlist?list=PL2-dafEMk2A5FZ-MnPMpp3PBtZcINKwLA - https://towardsdatascience.com/reinforcement-learning-demystified-markov-decision- processes-part-1-bf00dda41690 - https://towardsdatascience.com/reinforcement-learning-an-introduction-to-the-concepts- applications-and-code-ced6fbfd882d - https://deeplizard.com/learn/video/QK_PP_2KgGE - https://astrobear.top/2020/02/23/RLSummary6/