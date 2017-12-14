Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.
ㅣMachine Learning
Definition of Machine Learning
• Machine Learning is a field of study that give computer ability to learn without being explicitly
programmed.
Arthur Samuel (1959)
ㅣMachine Learning
There are 3 types of Machine Learning Algorithms :
• Supervised Learning
• Unsupervised Learning
• Reinforcement Learning
Labeled data
Direct feedback
Predict outcome/future
Supervised
No labels
No feedback
“Find hidden structure”
Unsupervised
Decision process
Reward system
Learn series of actions
Reinforcement
ㅣMachine Learning
There are 3 types of Machine Learning Algorithms :
• Supervised Learning
• Unsupervised Learning
• Reinforcement Learning
Given input variables and an output variable, you use supervised learning algorithm to learn the mapping
function from the input to output.
x1
x2
x3
…
xn
y = f(x)
ㅣMachine Learning
There are 3 types of Machine Learning Algorithms :
• Supervised Learning
• Unsupervised Learning
• Reinforcement Learning
Source : Coursera ML class - week01 introduction Source : Coursera ML class - week01 introduction
Linear Regression
• Predicts future
• Output variable = real value
• eg. Dollars ($)
Classification
• YES or NO
• Output variable = category
• eg. Red or Blue , Disease or No Disease
ㅣMachine Learning
There are 3 types of Machine Learning Algorithms :
• Supervised Learning
• Unsupervised Learning
• Reinforcement Learning
Given input data, unsupervised learning helps to model the underlying structure or distribution in the data in
order to learn more about the data.
x1
x2
x3
…
xn
???
ㅣMachine Learning
There are 3 types of Machine Learning Algorithms :
• Supervised Learning
• Unsupervised Learning
• Reinforcement Learning
Clustering
• Discover inherent groupings in the data
• eg. Customers’ purchasing behaviour
Association
• Ruled based machine learning method for discovering relations
between variables in large data set
• If → then
• eg. Which item do customers purchase after purchasing milk?
Source : Coursera ML class - week01 introduction Source : Market Basket Analysis in Retail
ㅣMachine Learning
There are 3 types of Machine Learning Algorithms :
• Supervised Learning
• Unsupervised Learning
• Reinforcement Learning
Machines take trial and error → Rule of thumb
• Learn from an environment
ㅣMarkov Decision Process
Def. A Markov Decision Process is a tuple 𝑆, 𝐴, 𝑃, 𝑅, 𝛾
• 𝑆 is a finite set of states
• 𝐴 is a finite set of actions
• 𝑃 is a state transition probability matrix,
𝑃𝑠𝑠′
𝑎
= 𝑃𝑟𝑜𝑏 𝑆𝑡+1 = 𝑠′
|𝑆𝑡 = 𝑠, 𝐴 𝑡 = 𝑎
• 𝑅 is a reward function,𝑅 𝑠
𝑎
= 𝐸𝑥𝑝 𝑅𝑡+1|𝑆𝑡 = 𝑠, 𝐴 𝑡 = 𝑎
• 𝛾 is a discount factor 𝛾 ∈ [0,1]
A Markov Decision Process is a Markov reward process with decisions. It is an environment in which all
states are Markov.
ㅣMarkov Decision Process
State & Action
Source : Fundamental of Reinforcement Learning
ㅣMarkov Decision Process
Markov Chain
• A state ONLY depends on the previous state.
• State diagram
• State transition probability matrix
Source : Fundamental of Reinforcement Learning
Facebook
0 0.5 0 0 0.5 0 0
0 0 0.8 0 0 0 0.2
0 0 0 0.6 0 0.4 0
0 0 0 0 0 0 1.0
0.1 0 0 0 0.9 0 0
0.2 0.4 0.4 0 0 0 0
0 0 0 0 0 0 0
Class 1
Sleep
Pub
Pass
Class 3
Class 2
Class1
Class2
Class3
Pass
Facebook
Pub
Sleep
Transition Matrix
ㅣMarkov Decision Process
Markov Chain
• State diagram
Source : Fundamental of Reinforcement Learning
ㅣMarkov Decision Process
Reward
Def. 𝑅 is a reward function, 𝑅 𝑠
𝑎
= 𝐸𝑥𝑝 𝑅𝑡+1|𝑆𝑡 = 𝑠, 𝐴 𝑡 = 𝑎
A Markov Decision Process is a Markov Reward Process with decisions. It is an environment in which all
states are Markov.
ㅣMarkov Decision Process
Discount factor
Def. 𝛾 is a discount factor 𝛾 ∈ [0,1]
It’s reasonable to maximize the sum of rewards.
It’s also reasonable to prefer rewards now to reward later.
1 𝛾 𝛾2
Value of rewards decay exponentially
ㅣMarkov Decision Process
Policy
Def. A policy 𝜋 is a distribution over actions given states
• 𝜋 𝑎 𝑠 = 𝑃𝑟𝑜𝑏[𝐴 𝑡 = 𝑎|𝑆𝑡 = 𝑠]
Determining an action in a certain state on a specific time is called “Policy”.
ㅣMarkov Decision Process
Value function
• State-value function
Def. The return 𝐺𝑡 is the total discounted reward from time-step 𝑡.
• 𝐺𝑡 = 𝑅𝑡+1 + 𝛾𝑅𝑡+2 + ⋯ = 𝑘=0
∞
𝛾 𝑅𝑡+𝑘+1
Def. The state value function 𝑣(𝑠) of an Markov Reward Process is the expected return starting from state 𝑠.
• 𝑅 is a reward function, 𝑣(𝑠) = 𝐸𝑥𝑝 𝐺𝑡|𝑆𝑡 = 𝑠
The value function 𝑣(𝑠) gives the long-term value of the state 𝑠.
ㅣMarkov Decision Process
Value function
• State-value function for policy
Def. The state-value function 𝑣 𝜋(𝑠) of an Markov Decision Process is the expected return starting from
state 𝑠, and then following policy 𝜋.
𝑣 𝜋(𝑠) = 𝐸𝑥𝑝 𝜋 𝐺𝑡|𝑆𝑡 = 𝑠
The state-value function can be varied for each policy.
Since we need to find the policy which maximizes its value function, state-value function plays an important
role in reinforcement learning.
ㅣMarkov Decision Process
Value function
• Action-value function
Def. The action-value function 𝑞 𝜋(𝑠, 𝑎) is the expected return starting from state 𝑠, taking action 𝑎, and then
following policy 𝜋.
𝑞 𝜋(𝑠, 𝑎) = 𝐸𝑥𝑝 𝜋 𝐺𝑡|𝑆𝑡 = 𝑠, 𝐴 𝑡 = 𝑎
This is the expected value of return when an action is taken in a certain state.
ㅣBellman Equation
Bellman Expectation Equation
• Bellman equation for value function
The value function can be decomposed into two parts:
• immediate reward 𝑅𝑡+1
• discounted value of successor state γ𝑣(𝑆𝑡+1)
𝑣 𝑠 = 𝐸𝑥𝑝 𝐺𝑡|𝑆𝑡 = 𝑠
= 𝐸𝑥𝑝 𝑅𝑡+1 + 𝛾𝑅𝑡+2 + 𝛾2
𝑅𝑡+3 + ⋯ |𝑆𝑡 = 𝑠
= 𝐸𝑥𝑝 𝑅𝑡+1 + 𝛾(𝑅𝑡+2 + 𝛾𝑅𝑡+3 + ⋯ )|𝑆𝑡 = 𝑠
= 𝐸𝑥𝑝 𝑅𝑡+1 + 𝛾𝐺𝑡+1|𝑆𝑡 = 𝑠
= 𝐸𝑥𝑝 𝑅𝑡+1 + γ𝑣(𝑆𝑡+1)|𝑆𝑡 = 𝑠
The action-value function can similarly be decomposed,
𝑞 𝜋(𝑠, 𝑎) = 𝐸𝑥𝑝 𝜋 𝑅𝑡+1 + γ𝑞 𝜋(𝑆𝑡+1, 𝐴 𝑡+1)|𝑆𝑡 = 𝑠, 𝐴 𝑡 = 𝑎
ㅣBellman Equation
Bellman Optimality Equation
• Optimal value function
Def. The optimal state-value function 𝑣∗(𝑠) is the maximum value function over all policies
𝑣∗ 𝑠 = max
𝜋
𝑣 𝜋 𝑠
Def. The optimal action-value function 𝑞∗(𝑠, 𝑎) is the maximum action-value function over all policies
𝑞∗ 𝑠, 𝑎 = max
𝜋
𝑞 𝜋 𝑠, 𝑎
• The optimal value function specifies the best possible performance in the Markov Decision Process.
• An Markov Decision Process is “solved” when we know the optimal value function.
ㅣBellman Equation
Bellman Optimality Equation
• Optimal policy
An optimal policy can be found by maximizing over 𝑞∗ 𝑠, 𝑎 ,
𝜋∗ 𝑎|𝑠 =
1, 𝑖𝑓 𝑎 = argmax
𝑎∈𝐴
𝑞∗ 𝑠, 𝑎
0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
• There is always a deterministic optimal policy for any Markov Decision Process.
• If we know 𝑞∗ 𝑠, 𝑎 , we immediately have the optimal policy.
ㅣBellman Equation
State Transition Probability Diagram
Source : Fundamental of Reinforcement Learning
ㅣBellman Equation
Bellman Optimality Equation
The optimal value functions are recursively related by the Bellman Optimality Equations
𝑣∗ 𝑠 = max
𝑎
𝑞∗ 𝑠, 𝑎
𝑞∗ 𝑠, 𝑎 = 𝑅 𝑠
𝑎
+ 𝛾
𝑠′∈𝑆
𝑃𝑠𝑠′
𝑎
𝑣∗(𝑠′
) 𝑣∗ 𝑠 = max
𝑎
𝑅 𝑠
𝑎
+ 𝛾
𝑠′∈𝑆
𝑃𝑠𝑠′
𝑎
𝑣∗(𝑠′
)
ㅣDynamic Programming
Dynamic Programming divides problem into subproblems, which are themselves usually divided into further
subproblems.
A better name for Dynamic Programming might be Recursive Optimization.
eg. Shortest dipath problems
1
i
j
