13_RL_1.pdf

Reinforcement Learning Basics
Shusen Wang
Stevens Institute of Technology
http://wangshusen.github.io/

A little bit probability theory…

• Random variable: unknown; its values depend on outcomes of
random events.
Random Variable
0
1
X =
Random Possible Random
Variable Val es E ents
Probabilities
ℙ 𝑋 = 0 = 0.5
ℙ 𝑋 = 1 = 0.5

• Random variable: unknown; its values depend on outcomes of
random events.
• Uppercase letter 𝑋 for a random variable.
• Lowercase letter 𝑥 for an observed value.
• For example, I flipped a coin 4 times and observed:
• 𝑥) = 1,
• 𝑥* = 1,
• 𝑥+ = 0,
• 𝑥, = 1.
Random Variable

• PDF provides a relative likelihood that the value of the random
variable would equal that sample.
Probability Density Function (PDF)

• PDF provides a relative likelihood that the value of the random
variable would equal that sample.
Probability Density Function (PDF)
• It is a continuous distribution.
• PDF:
𝑝 𝑥 =
)
*./0
exp −
678 0
* /0 .
Example: Gaussian distribution
Probability Density
𝑥
𝜇

• PMF is a function that gives the probability that a discrete random
variable is exactly equal to some value.
Probability Mass Function (PMF)

• Discrete random variable: 𝑋 ∈ 1, 3, 7 .
• PDF:
𝑝 1 = 0.2,
𝑝 3 = 0.5,
𝑝 7 = 0.3.
Example
1 7
𝑥
Probability Mass
3
0.2
0.5
0.3

• Discrete random variable: 𝑋 ∈ 1, 3, 7 .
• PDF:
𝑝 1 = 0.2,
𝑝 3 = 0.5,
𝑝 7 = 0.3.
Example
1 7
𝑥
3
0.2
0.5
0.3
Probability Mass

Properties of PDF/PMF
• Random variable 𝑋 is in the domain 𝒳.
• For continuous distributions,
∫ 𝑝 𝑥 𝑑𝑥
𝒳
= 1.
• For discrete distributions,
∑ 𝑝 𝑥
6∈𝒳 = 1.

• Random variable 𝑋 is in the domain 𝒳.
• For continuous distributions, the expectation of 𝑓(𝑋) is:
𝔼 𝑓 𝑋 = ∫ 𝑝 𝑥 ⋅ 𝑓 𝑥 𝑑𝑥
𝒳
.
• For discrete distributions, the expectation of 𝑓(𝑋) is:
𝔼 𝑓 𝑋 = ∑ 𝑝 𝑥
6∈𝒳 ⋅ 𝑓 𝑥 .
Expectation

• There are 10 balls in the bin: 2 are red, 5 are green, and 3 are blue.
Random Sampling
Random Sampling
What is the color?

Random Sampling
• Sample red ball w.p. 0.2, green ball w.p. 0.5, and blue ball w.p. 0.3.

Terminology: state and action
state 𝑠 (this frame) Action 𝑎 ∈ {left, right, up}
left right
up
Agent

Terminology: policy
left right
up
policy 𝝅
• Policy function 𝜋: 𝑠, 𝑎 ↦ 0,1 :
𝜋 𝑎 | 𝑠 = ℙ 𝐴 = 𝑎| 𝑆 = 𝑠 .

Terminology: policy
left right
up
• Policy function 𝜋: 𝑠, 𝑎 ↦ 0,1 :
𝜋 𝑎 | 𝑠 = ℙ 𝐴 = 𝑎| 𝑆 = 𝑠 .
policy 𝝅

Terminology: policy
left right
up
• 𝜋 𝑎 | 𝑠 is the probability of taking
action 𝐴 = 𝑎 given state 𝑠 , e.g.,
• 𝜋 left | 𝑠 = 0.2,
• 𝜋 right| s = 0.1,
• 𝜋 up | 𝑠 = 0.7.
policy 𝝅

Terminology: policy
left right
up
policy 𝝅
• 𝜋 𝑎 | 𝑠 is the probability of taking
action 𝐴 = 𝑎 given state 𝑠 , e.g.,
• 𝜋 left | 𝑠 = 0.2,
• 𝜋 right| s = 0.1,
• 𝜋 up | 𝑠 = 0.7.
• Upon observing state 𝑆 = 𝑠, the
agent’s action 𝐴 can be random.

Terminology: policy
left right
up
Random or deterministic policy?

Terminology: reward
• Collect a coin: 𝑅 = +1
reward 𝑅

Terminology: reward
reward 𝑅
• Win the game: 𝑅 = +10000

Terminology: reward
reward 𝑅
• Touch a Goomba: 𝑅 = −10000
(game over).

Terminology: reward
• Touch a Goomba: 𝑅 = −10000
(game over).
• Nothing happens: 𝑅 = 0
reward 𝑅

Terminology: state transition
state transition
old state new state
action

• E.g., “up” action leads to a new state.
state transition
up
old state new state
action

• State transition can be random.
• Randomness is from the environment.
state transition
up
old state new state
action

state transition
up
w.p. 0.8 w.p. 0.2
old state new state
action

• 𝑝 𝑠_
𝑠, 𝑎 = ℙ 𝑆_
= 𝑠_
𝑆 = 𝑠, 𝐴 = 𝑎 .
state transition
w.p. 0.8 w.p. 0.2
old state new state
action

𝑠
left
0.2
right
0.1
up
0.7
Randomness in Actions

𝑠
𝐴~𝜋(⋅ |𝑠)
Given state 𝑠, the action can be
random, e.g., .
• 𝜋 “left” 𝑠 = 0.2,
• 𝜋 “right” 𝑠 = 0.1,
• 𝜋 “up” 𝑠 = 0.7.
left
0.2
right
0.1
up
0.7
Randomness in Actions

Randomness in States
𝑠
𝑎
• The environment generates the
new state 𝑆_
by
𝑆_
~ 𝑝(⋅ |𝑠, 𝑎) .

Randomness in States
𝑠
𝑎
0.8 0.2
• The environment generates the
new state 𝑆_
by
𝑆_
~ 𝑝(⋅ |𝑠, 𝑎) .

• The randomness in action is from the policy function:
𝐴 ~ 𝜋 ⋅ 𝑠 .
• The randomness in state is from the state-transition function:
𝑆_
~ 𝑝 ⋅ 𝑠, 𝑎 ) .
Two Sources of Randomness

Agent-Environment Interaction
Environment
State 𝑠d
Agent

Environment
Action 𝑎d
State 𝑠d
Agent

State 𝑠d
Environment
Agent
Action 𝑎d
Reward 𝑟d
State 𝑠df)

Play game using AI
• Observe state 𝑠d, select action 𝑎d ~ 𝜋 ⋅ 𝑠d , and execute 𝑎d.
• The environment gives new state 𝑠df) and reward 𝑟d.

Play game using AI
𝑠) 𝑎) 𝑠*
𝑟)

Play game using AI
𝑠) 𝑎) 𝑠* 𝑎* 𝑠+ 𝑎+ 𝑠,
𝑟) 𝑟* 𝑟+
⋯

Play game using AI
𝑠) 𝑎) 𝑠* 𝑎* 𝑠+ 𝑎+ 𝑠,
𝑟) 𝑟* 𝑟+
⋯
• (state, action, reward) trajectory:
𝑠), 𝑎), 𝑟), 𝑠*, 𝑎*, 𝑟*, ⋯ , 𝑠h, 𝑎h, 𝑟h.
• One episode is from the the beginning to the end (Mario wins or dies).

Play game using AI
• (state, action, reward) trajectory:
𝑠), 𝑎), 𝑟), 𝑠*, 𝑎*, 𝑟*, ⋯ , 𝑠h, 𝑎h, 𝑟h.
• One episode is from the the beginning to the end (Mario wins or dies).
• What is a good policy?
• A good policy leads to big cumulative reward: ∑ 𝛾d7)
⋅ 𝑟d
h
dj) .
• Use the rewards to guide the learning of policy.

Return
• 𝑈d = 𝑅d + 𝑅df) + 𝑅df* + 𝑅df+ + ⋯
Definition: Return (aka cumulative future reward).

Return
• 𝑈d = 𝑅d + 𝑅df) + 𝑅df* + 𝑅df+ + ⋯
Question: At time 𝑡, are 𝑅d and 𝑅df) equally important?
• Which of the followings do you prefer?
• I give you $100 right now.
• I will give you $100 one year later.

Return
• 𝑈d = 𝑅d + 𝑅df) + 𝑅df* + 𝑅df+ + ⋯

Return
• 𝑈d = 𝑅d + 𝑅df) + 𝑅df* + 𝑅df+ + ⋯
• Future reward is less valuable than present reward.
• 𝑅df) should be given less weight than 𝑅d.

Discounted Returns
• 𝑈d = 𝑅d + 𝑅df) + 𝑅df* + 𝑅df+ + ⋯
• 𝛾: discount factor (tuning hyper-parameter).
• 𝑈d = 𝑅d + 𝛾 𝑅df) + 𝛾*
𝑅df* + 𝛾+
𝑅df+ + ⋯
Definition: Discounted return (aka cumulative discounted future reward).

Discounted Returns
• 𝑈d = 𝑅d + 𝛾 𝑅df) + 𝛾*
𝑅df* + ⋯ + 𝛾h7d
𝑅h.
Definition: Discounted return (at time 𝑡).

Randomness in Returns
At the end of the game, we observe 𝑢d.
• We observe all the rewards, 𝑟d, 𝑟df), 𝑟df*, ⋯ , 𝑟h.
• We thereby know the discounted return:
𝑢d = 𝑟d + 𝛾 𝑟df) + 𝛾*
𝑟df* + ⋯ + 𝛾h7d
𝑟h.
• 𝑈d = 𝑅d + 𝛾 𝑅df) + 𝛾*
𝑅h.

At time 𝑡, the rewards, 𝑅d, ⋯ , 𝑅h, are random, so the return 𝑈d is random.
• 𝑈d = 𝑅d + 𝛾 𝑅df) + 𝛾*
𝑅h.

• 𝑈d = 𝑅d + 𝛾 𝑅df) + 𝛾*
𝑅h.
• Reward 𝑅n depends on 𝑆n and 𝐴n.
• States can be random: 𝑆n ∼ 𝑝 ⋅ 𝑠n7), 𝑎n7) .
• Actions can be random: 𝐴n ∼ 𝜋 ⋅ 𝑠n .
• If either 𝑆n or 𝐴n is random, then 𝑅n is random.

• 𝑈d = 𝑅d + 𝛾 𝑅df) + 𝛾*
𝑅h.
• Reward 𝑅n depends on 𝑆n and 𝐴n.
• 𝑈d depends on 𝑅d, 𝑅df), ⋯ , 𝑅h.
• è 𝑈d depends on 𝑆d, 𝐴d, 𝑆df), 𝐴df), ⋯ , 𝑆h, 𝐴h.

Time
Time t End

Time
Time t End
𝑅d 𝑅df) 𝑅df* 𝑅h
⋯
𝑈d is a random variable

Time
Time t End
𝑟d 𝑟df) 𝑟df* 𝑟h
⋯
𝑢d is an observed value

Action-Value Function 𝑄. 𝑠, 𝑎
• 𝑈d = 𝑅d + 𝛾 𝑅df) + 𝛾*
𝑅h.
Definition: Discounted return.

Definition: Action-value function.
• 𝑄. 𝑠d, 𝑎d = 𝔼 𝑈d | 𝑆d = 𝑠d, 𝐴d = 𝑎d .
• 𝑈d = 𝑅d + 𝛾 𝑅df) + 𝛾*
𝑅h.

• 𝑈d = 𝑅d + 𝛾 𝑅df) + 𝛾*
𝑅h.
𝑈d depends on states 𝑆d, 𝑆df), ⋯ , 𝑆h and actions 𝐴d, 𝐴df), ⋯ , 𝐴h.

• 𝑈d = 𝑅d + 𝛾 𝑅df) + 𝛾*
𝑅h.
Regard 𝑠d and 𝑎d as observed values.
Regard 𝑆df), ⋯ , 𝑆h and 𝐴df), ⋯ , 𝐴h as random variables.

• 𝑈d = 𝑅d + 𝛾 𝑅df) + 𝛾*
𝑅h.
Regard 𝑆df), ⋯ , 𝑆h and 𝐴df), ⋯ , 𝐴h as random variables.

• 𝑈d = 𝑅d + 𝛾 𝑅df) + 𝛾*
𝑅h.
• 𝑆df) ~ 𝑝 ⋅ 𝑠d, 𝑎d ,
⋮
• 𝑆h ~ 𝑝 ⋅ 𝑠h7), 𝑎h7) .
• 𝐴df) ~ 𝜋 ⋅ 𝑠df) ,
⋮
• 𝐴h ~ 𝜋 ⋅ 𝑠h .

• 𝑈d = 𝑅d + 𝛾 𝑅df) + 𝛾*
𝑅h.
• 𝑄. 𝑠d, 𝑎d depends on 𝑠d, 𝑎d, 𝜋, and 𝑝.
• 𝑄. 𝑠d, 𝑎d is dependent of 𝑆df), ⋯ , 𝑆h and 𝐴df), ⋯ , 𝐴h.

State-Value Function 𝑉
. 𝑠
• 𝑈d = 𝑅d + 𝛾 𝑅df) + 𝛾*
𝑅h.
Definition: State-value function.
• 𝑉
. 𝑠d = 𝔼s 𝑄. 𝑠d, 𝐴 = ∑ 𝜋 𝑎 𝑠d ⋅ 𝑄. 𝑠d, 𝑎
t . (Actions are discrete.)

. 𝑠
• 𝑉
. 𝑠d = 𝔼s 𝑄. 𝑠d, 𝐴
Taken w.r.t. the action 𝐴 ∼ 𝜋 ⋅ 𝑠d .
• 𝑈d = 𝑅d + 𝛾 𝑅df) + 𝛾*
𝑅h.

. 𝑠
• 𝑉
Taken w.r.t. the action 𝐴 ∼ 𝜋 ⋅ 𝑠d .
• 𝑈d = 𝑅d + 𝛾 𝑅df) + 𝛾*
𝑅h.

. 𝑠
• 𝑉
• 𝑉
. 𝑠d = 𝔼s 𝑄. 𝑠d, 𝐴 = ∫ 𝜋 𝑎 𝑠d ⋅ 𝑄. 𝑠d, 𝑎 𝑑𝑎 . (Actions are continuous.)
• 𝑈d = 𝑅d + 𝛾 𝑅df) + 𝛾*
𝑅h.

Understanding the Value Functions
• Action-value function: 𝑄. 𝑠, 𝑎 = 𝔼 𝑈d|𝑆d = 𝑠, 𝐴d = 𝑎 .
• Given policy 𝜋, 𝑄. 𝑠, 𝑎 evaluates how good it is for an agent to pick
action 𝑎 while being in state 𝑠.

• Action-value function: 𝑄. 𝑠, 𝑎 = 𝔼 𝑈d|𝑆d = 𝑠, 𝐴d = 𝑎 .
• Given policy 𝜋, 𝑄. 𝑠, 𝑎 evaluates how good it is for an agent to pick
action 𝑎 while being in state 𝑠.
• State-value function: 𝑉
. 𝑠 = 𝔼s 𝑄. 𝑠, 𝐴
• For fixed policy 𝜋, 𝑉
. 𝑠 evaluates how good the situation is in state 𝑠.
• 𝔼u 𝑉
. 𝑆 evaluates how good the policy 𝜋 is.
Understanding the Value Functions

Evaluating Reinforcement Learning

• Gym is a toolkit for developing and comparing reinforcement learning algorithms.
• https://gym.openai.com/
OpenAI Gym

OpenAI Gym
Cart Pole Pendulum
Classical control problems

OpenAI Gym
Pong Space Invader
Atari Games
Breakout

OpenAI Gym
Ant Humanoid
MuJoCo (Continuous control tasks.)
Half Cheetah

Play CartPole Game
• Get the environment of CartPole from Gym.
• “env” provides states and reward.

Play CartPole Game
A random action.
“done=1” means finished (win or lose the game)
A window pops up rendering CartPole.

• Agent
• Environment
• State 𝑠
• Action 𝑎
• Reward 𝑟
Summary
Terminologies

• Agent
• Environment
• State 𝑠
• Action 𝑎
• Reward 𝑟
• Policy 𝜋 𝑎 𝑠
• State transition 𝑝 𝑠_
𝑠, 𝑎
Summary
Terminologies

• Agent
• Environment
• State 𝑠
• Action 𝑎
• Reward 𝑟
𝑠, 𝑎
Summary
Terminologies Return and Value
• Return:
𝑈d = 𝑅d + 𝛾 𝑅df) + 𝛾* 𝑅df* + ⋯

• Agent
• Environment
• State 𝑠
• Action 𝑎
• Reward 𝑟
𝑠, 𝑎
Summary
Terminologies Return and Value
• Return:
𝑈d = 𝑅d + 𝛾 𝑅df) + 𝛾* 𝑅df* + ⋯
• Action-value function:
𝑄. 𝑠d, 𝑎d = 𝔼 𝑈d|𝑠d, 𝑎d .
• State-value function:
𝑉
. 𝑠d = 𝔼s~. 𝑄. 𝑠d, 𝐴 .

Play game using AI
𝑠d7) 𝑎d7) 𝑠d
𝑟d7* 𝑟d7)
⋯
Time t

Play game using AI
𝑟d7* 𝑟d7)
⋯ 𝑎d
𝜋
Time t

Play game using AI
𝑟d7* 𝑟d7)
⋯ 𝑎d 𝑠df)
𝑟d
𝑝
Time t

Play game using AI
𝑟d7* 𝑟d7)
⋯ 𝑎d 𝑠df)
𝑟d
Time t Time t+1
⋯

Thank You!
http://wangshusen.github.io/

13_RL_1.pdf

More Related Content

Similar to 13_RL_1.pdf

Recently uploaded

13_RL_1.pdf