NILOOFAR SEDIGHIAN BIDGOLI
MACHINE LEARNING COURSE
CS DEPARTMENT, SBU UNIVERSITY
JUNE 2020, TEHRAN, IRAN
When it is not in our power to determine
what is true, we ought to act in accordance
with what is most probable.
- Descartes
That thing is a
“double bacon cheese
burger
N.Sedighian - CS Dep. SBU - 06/2020
That thing is like this
other thing
N.Sedighian - CS Dep. SBU - 06/2020
Eat that thing because it
tastes good and will keep
you alive longer
N.Sedighian - CS Dep. SBU - 06/2020
Deep reinforcement learning is
about how we make decisions
To tackle decision-making problems under uncertainty
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
Two core components in a RL system
 Agent: represents the “solution”
 A computer program with a single role of making decisions to solve complex
decision-making problems under uncertainty.
 An Environment: that is the representation of a “problem”
 Everything that comes after the decision of the Agent.
N.Sedighian - CS Dep. SBU - 06/2020
Notations:
 State = s = x
 Action = control = a = u
 Policy 𝜋𝜋(𝑎𝑎|𝑠𝑠) is defined as probability and not as a concrete action
 like weights in Deep Learning method, parameterized by θ
 Gamma: We discount rewards or lower their estimated value in the future
 Human intuition: “In the long run, we are all dead.
 If it is 1: we care about all rewards equally
 If it is 0: we care only about the immediate reward
N.Sedighian - CS Dep. SBU - 06/2020
Policy
N.Sedighian - CS Dep. SBU - 06/2020
Intuition: why humans?
 If you are the agent, the environment could be the laws of physics and the
rules of society that process your actions and determine the
consequences of them.
Were you ever in the wrong place at the wrong time?
That’s a state
N.Sedighian - CS Dep. SBU - 06/2020
There is no training data here
 Like humans learning how to live (and survive!) as a kid
 By trial and error
 With positive or negative rewards
 Reward and punishment method
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
Google's artificial
intelligence company,
DeepMind, has
developed an AI that
has managed to learn
how to walk, run, jump,
and climb without any
prior guidance. The result
is as impressive as it is
goofy
Watch Video
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
Google
DeepMind
Learning to play Atari
Watch Video
N.Sedighian - CS Dep. SBU - 06/2020
Reward vs Value
 Reward (Return) is an immediate signal that is received in a given state,
while value is the sum of all rewards you might anticipate from that state.
 Value is a long-term expectation, while reward is an immediate pleasure.
N.Sedighian - CS Dep. SBU - 06/2020
Return
N.Sedighian - CS Dep. SBU - 06/2020
Tasks
 Natural ending: episodic tasks -> games
 Episode: sequence of time steps
 The sum of rewards collected in a single episode is called a return. Agents are
often designed to maximize the return.
 Without natural ending: continuing tasks -> learning forward motion
N.Sedighian - CS Dep. SBU - 06/2020
How the environment reacts to
certain actions is defined by a model
which may or may not be known by
the Agent
Approaches
 Analyze how good to reach a certain state or take a specific action (i.e.
Value-learning)
 measures the total rewards that you get from a particular state following a
specific policy
 Go cheat sheet
 uses V or Q value to derive the optimal policy
 Q- Learning
 Use the model to find actions that have the maximum rewards (model-
based learning)
 Model-based RL uses the model and the cost function to find the optimal path
 Derive a policy directly to maximize rewards (policy gradient)
 For actions with better rewards, we make it more likely to happen (or vice versa).N.Sedighian - CS Dep. SBU - 06/2020
For a model
based learning
Watch this →
Watch Video
N.Sedighian - CS Dep. SBU - 06/2020
RL;
exploit and explore
How can we
mathematically formalize
the RL problem
• MARKOV DECISION PROCESSES FORMALIZE THE REINFORCEMENT
LEARNING PROBLEM SET
• AND Q-LEARNING AND POLICY GRADIENTS ARE 2 MAJOR
ALGORITHMS IN THIS AREA
MDP
 Attempt to model a complex probability distribution of rewards in relation
to a very large number of state-action pair
 Markov decision process, a method to sample from a complex distribution
to infer its properties. even when we do not understand the mechanism by
which they relate
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
MPD
• Genes on a chromosome are
states. To read them (and create
amino acids) is to go through
their transitions
• Emotions are states in a
psychological system. Mood
swings are the transitions.
N.Sedighian - CS Dep. SBU - 06/2020
Markov chains have a particular property:
oblivion. Forgetting
It assume the entirety of the past is encoded in
the present
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
Q-learning
"quality" of an action taken in a given state
 Q-learning is a model-free reinforcement learning algorithm to learn a
policy telling an agent what action to take under what circumstances.
 For any finite Markov decision process (FMDP), Q-learning finds an optimal
policy in the sense of maximizing the expected value of the total reward
over any and all successive steps, starting from the current state.
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
Q
A value for each state-action pair, which is called
the action-value function, also known as Q-function.
It is usually denoted by 𝑄𝑄𝜋𝜋 (𝑠𝑠, 𝑎𝑎) and refers to the
expected return 𝐺𝐺 when the Agent is at state 𝑠𝑠 and
takes action 𝑎𝑎 following the policy 𝜋𝜋.
N.Sedighian - CS Dep. SBU - 06/2020
Break
west world…
Creation of Adam, 1508-1512
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
Bellman Equation
It writes the "value" of a decision problem at a
certain point in time in terms of the payoff from
some initial choices and the "value" of the
remaining decision problem that results from
those initial choices
that if we know the value of 𝑠𝑠𝑡𝑡+!, we can very easily calculate the value of 𝑠𝑠𝑡𝑡.
N.Sedighian - CS Dep. SBU - 06/2020
Iteration Phase:
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
DQN
Deep Q-network
Using a deep network to estimate Q
N.Sedighian - CS Dep. SBU - 06/2020
Experience Replay
Experience replay stores the last million of state-
action-reward in a replay buffer. We train Q with
batches of random samples from this buffer
 enabling the RL agent to sample from and train on previously observed data offline
 massively reduce the amount of interactions needed with the environment,
 batches of experience can be sampled, reducing the variance of learning updates
N.Sedighian - CS Dep. SBU - 06/2020
Experience!
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
Reinforce rule
= estimator of gradient
We change the policy in the direction with the steepest reward increase
It means for actions with better rewards, we make it more likely to happen
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
Actor-critic set-up:
The “actor”
(policy) learns by
using feedback
from the “critic”
(value function).
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
So…
N.Sedighian - CS Dep. SBU - 06/2020
N.Sedighian - CS Dep. SBU - 06/2020
Questions
Sophia, on from 2016N.Sedighian - CS Dep. SBU - 06/2020
Thank you
N.Sedighian - CS Dep. SBU - 06/2020

RL presentation

  • 1.
    NILOOFAR SEDIGHIAN BIDGOLI MACHINELEARNING COURSE CS DEPARTMENT, SBU UNIVERSITY JUNE 2020, TEHRAN, IRAN
  • 2.
    When it isnot in our power to determine what is true, we ought to act in accordance with what is most probable. - Descartes
  • 3.
    That thing isa “double bacon cheese burger N.Sedighian - CS Dep. SBU - 06/2020
  • 4.
    That thing islike this other thing N.Sedighian - CS Dep. SBU - 06/2020
  • 5.
    Eat that thingbecause it tastes good and will keep you alive longer N.Sedighian - CS Dep. SBU - 06/2020
  • 6.
    Deep reinforcement learningis about how we make decisions To tackle decision-making problems under uncertainty N.Sedighian - CS Dep. SBU - 06/2020
  • 7.
    N.Sedighian - CSDep. SBU - 06/2020
  • 8.
    Two core componentsin a RL system  Agent: represents the “solution”  A computer program with a single role of making decisions to solve complex decision-making problems under uncertainty.  An Environment: that is the representation of a “problem”  Everything that comes after the decision of the Agent. N.Sedighian - CS Dep. SBU - 06/2020
  • 9.
    Notations:  State =s = x  Action = control = a = u  Policy 𝜋𝜋(𝑎𝑎|𝑠𝑠) is defined as probability and not as a concrete action  like weights in Deep Learning method, parameterized by θ  Gamma: We discount rewards or lower their estimated value in the future  Human intuition: “In the long run, we are all dead.  If it is 1: we care about all rewards equally  If it is 0: we care only about the immediate reward N.Sedighian - CS Dep. SBU - 06/2020
  • 10.
    Policy N.Sedighian - CSDep. SBU - 06/2020
  • 11.
    Intuition: why humans? If you are the agent, the environment could be the laws of physics and the rules of society that process your actions and determine the consequences of them. Were you ever in the wrong place at the wrong time? That’s a state N.Sedighian - CS Dep. SBU - 06/2020
  • 12.
    There is notraining data here  Like humans learning how to live (and survive!) as a kid  By trial and error  With positive or negative rewards  Reward and punishment method N.Sedighian - CS Dep. SBU - 06/2020
  • 13.
    N.Sedighian - CSDep. SBU - 06/2020
  • 14.
    N.Sedighian - CSDep. SBU - 06/2020
  • 15.
    Google's artificial intelligence company, DeepMind,has developed an AI that has managed to learn how to walk, run, jump, and climb without any prior guidance. The result is as impressive as it is goofy Watch Video N.Sedighian - CS Dep. SBU - 06/2020
  • 16.
    N.Sedighian - CSDep. SBU - 06/2020
  • 17.
    Google DeepMind Learning to playAtari Watch Video N.Sedighian - CS Dep. SBU - 06/2020
  • 18.
    Reward vs Value Reward (Return) is an immediate signal that is received in a given state, while value is the sum of all rewards you might anticipate from that state.  Value is a long-term expectation, while reward is an immediate pleasure. N.Sedighian - CS Dep. SBU - 06/2020
  • 19.
    Return N.Sedighian - CSDep. SBU - 06/2020
  • 20.
    Tasks  Natural ending:episodic tasks -> games  Episode: sequence of time steps  The sum of rewards collected in a single episode is called a return. Agents are often designed to maximize the return.  Without natural ending: continuing tasks -> learning forward motion N.Sedighian - CS Dep. SBU - 06/2020
  • 21.
    How the environmentreacts to certain actions is defined by a model which may or may not be known by the Agent
  • 22.
    Approaches  Analyze howgood to reach a certain state or take a specific action (i.e. Value-learning)  measures the total rewards that you get from a particular state following a specific policy  Go cheat sheet  uses V or Q value to derive the optimal policy  Q- Learning  Use the model to find actions that have the maximum rewards (model- based learning)  Model-based RL uses the model and the cost function to find the optimal path  Derive a policy directly to maximize rewards (policy gradient)  For actions with better rewards, we make it more likely to happen (or vice versa).N.Sedighian - CS Dep. SBU - 06/2020
  • 23.
    For a model basedlearning Watch this → Watch Video N.Sedighian - CS Dep. SBU - 06/2020
  • 24.
  • 25.
    How can we mathematicallyformalize the RL problem • MARKOV DECISION PROCESSES FORMALIZE THE REINFORCEMENT LEARNING PROBLEM SET • AND Q-LEARNING AND POLICY GRADIENTS ARE 2 MAJOR ALGORITHMS IN THIS AREA
  • 26.
    MDP  Attempt tomodel a complex probability distribution of rewards in relation to a very large number of state-action pair  Markov decision process, a method to sample from a complex distribution to infer its properties. even when we do not understand the mechanism by which they relate N.Sedighian - CS Dep. SBU - 06/2020
  • 27.
    N.Sedighian - CSDep. SBU - 06/2020
  • 28.
    N.Sedighian - CSDep. SBU - 06/2020
  • 29.
    N.Sedighian - CSDep. SBU - 06/2020
  • 30.
    N.Sedighian - CSDep. SBU - 06/2020
  • 31.
    N.Sedighian - CSDep. SBU - 06/2020
  • 32.
    MPD • Genes ona chromosome are states. To read them (and create amino acids) is to go through their transitions • Emotions are states in a psychological system. Mood swings are the transitions. N.Sedighian - CS Dep. SBU - 06/2020
  • 33.
    Markov chains havea particular property: oblivion. Forgetting It assume the entirety of the past is encoded in the present N.Sedighian - CS Dep. SBU - 06/2020
  • 34.
    N.Sedighian - CSDep. SBU - 06/2020
  • 35.
    Q-learning "quality" of anaction taken in a given state  Q-learning is a model-free reinforcement learning algorithm to learn a policy telling an agent what action to take under what circumstances.  For any finite Markov decision process (FMDP), Q-learning finds an optimal policy in the sense of maximizing the expected value of the total reward over any and all successive steps, starting from the current state. N.Sedighian - CS Dep. SBU - 06/2020
  • 36.
    N.Sedighian - CSDep. SBU - 06/2020
  • 37.
    Q A value foreach state-action pair, which is called the action-value function, also known as Q-function. It is usually denoted by 𝑄𝑄𝜋𝜋 (𝑠𝑠, 𝑎𝑎) and refers to the expected return 𝐺𝐺 when the Agent is at state 𝑠𝑠 and takes action 𝑎𝑎 following the policy 𝜋𝜋. N.Sedighian - CS Dep. SBU - 06/2020
  • 38.
    Break west world… Creation ofAdam, 1508-1512 N.Sedighian - CS Dep. SBU - 06/2020
  • 39.
    N.Sedighian - CSDep. SBU - 06/2020
  • 40.
    Bellman Equation It writesthe "value" of a decision problem at a certain point in time in terms of the payoff from some initial choices and the "value" of the remaining decision problem that results from those initial choices that if we know the value of 𝑠𝑠𝑡𝑡+!, we can very easily calculate the value of 𝑠𝑠𝑡𝑡. N.Sedighian - CS Dep. SBU - 06/2020
  • 41.
    Iteration Phase: N.Sedighian -CS Dep. SBU - 06/2020
  • 42.
    N.Sedighian - CSDep. SBU - 06/2020
  • 43.
    N.Sedighian - CSDep. SBU - 06/2020
  • 44.
    N.Sedighian - CSDep. SBU - 06/2020
  • 45.
    DQN Deep Q-network Using adeep network to estimate Q N.Sedighian - CS Dep. SBU - 06/2020
  • 46.
    Experience Replay Experience replaystores the last million of state- action-reward in a replay buffer. We train Q with batches of random samples from this buffer  enabling the RL agent to sample from and train on previously observed data offline  massively reduce the amount of interactions needed with the environment,  batches of experience can be sampled, reducing the variance of learning updates N.Sedighian - CS Dep. SBU - 06/2020
  • 47.
    Experience! N.Sedighian - CSDep. SBU - 06/2020
  • 48.
    N.Sedighian - CSDep. SBU - 06/2020
  • 49.
    N.Sedighian - CSDep. SBU - 06/2020
  • 50.
    N.Sedighian - CSDep. SBU - 06/2020
  • 51.
    Reinforce rule = estimatorof gradient We change the policy in the direction with the steepest reward increase It means for actions with better rewards, we make it more likely to happen N.Sedighian - CS Dep. SBU - 06/2020
  • 52.
    N.Sedighian - CSDep. SBU - 06/2020
  • 53.
    N.Sedighian - CSDep. SBU - 06/2020
  • 54.
    N.Sedighian - CSDep. SBU - 06/2020
  • 55.
    N.Sedighian - CSDep. SBU - 06/2020
  • 56.
    N.Sedighian - CSDep. SBU - 06/2020
  • 57.
    N.Sedighian - CSDep. SBU - 06/2020
  • 58.
    Actor-critic set-up: The “actor” (policy)learns by using feedback from the “critic” (value function). N.Sedighian - CS Dep. SBU - 06/2020
  • 59.
    N.Sedighian - CSDep. SBU - 06/2020
  • 60.
    N.Sedighian - CSDep. SBU - 06/2020
  • 61.
    N.Sedighian - CSDep. SBU - 06/2020
  • 62.
    N.Sedighian - CSDep. SBU - 06/2020
  • 63.
    So… N.Sedighian - CSDep. SBU - 06/2020
  • 64.
    N.Sedighian - CSDep. SBU - 06/2020
  • 65.
    Questions Sophia, on from2016N.Sedighian - CS Dep. SBU - 06/2020
  • 66.
    Thank you N.Sedighian -CS Dep. SBU - 06/2020