SlideShare a Scribd company logo
1 of 34
Reinforcement Learning
INDHUJA LV
What is Machine learning?
 Machine Learning takes place as a result of
interaction between an agent and the world,
the idea behind machine learning is that
 Percepts received by an agent should be used not
only for acting, but also for improving the agent’s
ability to behave optimally in the future to achieve
the goal.
Machine Learning types
 Machine Learning types
 Supervised learning:
a situation in which sample (input, output) pairs of the
function to be learned can be perceived or are given
 You can think it as if there is a kind teacher
 Reinforcement learning:
in the case of the agent acts on its environment, it
receives some evaluation of its action (reinforcement),
but is not told of which action is the correct one to
achieve its goal
Reinforcement learning
 Task
Learn how to behave successfully to achieve a
goal while interacting with an external
environment
 Learn via experiences!
 Examples
 Game playing: player knows whether it win or lose,
but not know how to move at each step
 Control: a traffic system can measure the delay of
cars, but not know how to decrease it.
RL is learning from interaction
RL model
 Each percept(e) is enough to determine the
State(the state is accessible)
 The agent can decompose the Reward component
from a percept.
 The agent task: to find a optimal policy, mapping
states to actions, that maximize long-run measure
of the reinforcement
 Think of reinforcement as reward
 Can be modeled as MDP model!
Review of MDP model
 MDP model <S,T,A,R>
Agent
Environment
State
Reward
Action
s0
r0
a0
s1
a1
r1
s2
a2
r2
s3
• S– set of states
• A– set of actions
• T(s,a,s’) = P(s’|s,a)– the
probability of transition from
s to s’ given action a
• R(s,a)– the expected reward
for taking action a in state s




'
'
)
'
,
,
(
)
'
,
,
(
)
,
(
)
'
,
,
(
)
,
|
'
(
)
,
(
s
s
s
a
s
r
s
a
s
T
a
s
R
s
a
s
r
a
s
s
P
a
s
R
Model based v.s.Model free
approaches
 But, we don’t know anything about the environment
model—the transition function T(s,a,s’)
 Here comes two approaches
 Model based approach RL:
learn the model, and use it to derive the optimal policy.
e.g Adaptive dynamic learning(ADP) approach
 Model free approach RL:
derive the optimal policy without learning the model.
e.g LMS and Temporal difference approach
 Which one is better?
Passive learning v.s. Active
learning
 Passive learning
 The agent imply watches the world going by and
tries to learn the utilities of being in various states
 Active learning
 The agent not simply watches, but also acts
Example environment
Passive learning scenario
 The agent see the the sequences of state
transitions and associate rewards
 The environment generates state transitions and the
agent perceive them
e.g (1,1) (1,2) (1,3) (2,3) (3,3) (4,3)[+1]
(1,1)(1,2) (1,3) (1,2) (1,3) (1,2) (1,1) (2,1)
(3,1) (4,1) (4,2)[-1]
 Key idea: updating the utility value using the
given training sequences.
Passive leaning scenario
LMS updating
 Reward to go of a state
the sum of the rewards from that state until a
terminal state is reached
 Key: use observed reward to go of the state as
the direct evidence of the actual expected utility
of that state
 Learning utility function directly from sequence
example
LMS updating
function LMS-UPDATE (U, e, percepts, M, N ) return an updated U
if TERMINAL?[e] then
{ reward-to-go  0
for each ei in percepts (starting from end) do
s = STATE[ei]
reward-to-go  reward-to-go + REWARS[ei]
U[s] = RUNNING-AVERAGE (U[s], reward-to-go, N[s])
end
}
function RUNNING-AVERAGE (U[s], reward-to-go, N[s] )
U[s] = [ U[s] * (N[s] – 1) + reward-to-go ] / N[s]
LMS updating algorithm in
passive learning
 Drawback:
 The actual utility of a state is constrained to be probability- weighted
average of its successor’s utilities.
 Converge very slowly to correct utilities values (requires a lot of sequences)
 for our example, >1000!
Temporal difference method in
passive learning
 TD(0) key idea:
 adjust the estimated utility value of the current state based on its
immediately reward and the estimated value of the next state.
 The updating rule
 is the learning rate parameter
 Only when is a function that decreases as the number of times
a state has been visited increased, then can U(s)converge to the
correct value.
))
(
)
'
(
)
(
(
)
(
)
( s
U
s
U
s
R
s
U
s
U 


 


The TD learning curve
(4,3)
(2,3)
(2,2)
(1,1)
(3,1)
(4,1)
(4,2)
Adaptive dynamic programming(ADP)
in passive learning
 Different with LMS and TD method(model free
approaches)
 ADP is a model based approach!
 The updating rule for passive learning
 However, in an unknown environment, T is not given,
the agent must learn T itself by experiences with the
environment.
 How to learn T?
))
'
(
)
'
,
(
(
)
'
,
(
)
(
'
s
U
s
s
r
s
s
T
s
U
s


 
ADP learning curves
(4,3)
(3,3)
(2,3)
(1,1)
(3,1)
(4,1)
(4,2)
Active learning
 An active agent must consider
 what actions to take?
 what their outcomes maybe(both on learning and receiving the
rewards in the long run)?
 Update utility equation
 Rule to chose action
))
'
(
)
'
,
,
(
)
,
(
(
max
arg
'
s
U
s
a
s
T
a
s
R
a
s
a


 
))
'
(
)
'
,
,
(
)
,
(
(
max
)
(
'
s
U
s
a
s
T
a
s
R
s
U
s
a


 
Active ADP algorithm
For each s, initialize U(s) , T(s,a,s’) and R(s,a)
Initialize s to current state that is perceived
Loop forever
{
Select an action a and execute it (using current model R and T) using
Receive immediate reward r and observe the new state s’
Using the transition tuple <s,a,s’,r> to update model R and T (see further)
For all the sate s, update U(s) using the updating rule
s = s’
}
))
'
(
)
'
,
,
(
)
,
(
(
max
arg
'
s
U
s
a
s
T
a
s
R
a
s
a


 
))
'
(
)
'
,
,
(
)
,
(
(
max
)
(
'
s
U
s
a
s
T
a
s
R
s
U
s
a


 
How to learn model?
 Use the transition tuple <s, a, s’, r> to learn T(s,a,s’) and
R(s,a). That’s supervised learning!
 Since the agent can get every transition (s, a, s’,r) directly, so take
(s,a)/s’ as an input/output example of the transition probability
function T.
 Different techniques in the supervised learning(see further reading
for detail)
 Use r and T(s,a,s’) to learn R(s,a)


'
)
'
,
,
(
)
,
(
s
r
s
a
s
T
a
s
R
ADP approach pros and cons
 Pros:
 ADP algorithm converges far faster than LMS and Temporal
learning. That is because it use the information from the the model
of the environment.
 Cons:
 Intractable for large state space
 In each step, update U for all states
 Improve this by prioritized-sweeping (see further reading for detail)
Another model free method–
TD-Q learning
 Define Q-value function
 Q-value function updating rule
<*>
 Key idea of TD-Q learning
 Combined with temporal difference approach
 The updating rule
 Rule to chose the action to take
)
,
(
max
)
( a
s
Q
s
U
a

))
'
(
)
'
,
,
(
)
,
(
(
max
)
(
'
s
U
s
a
s
T
a
s
R
s
U
s
a


 
)
'
(
)
'
,
,
(
)
,
(
)
,
(
'
s
U
s
a
s
T
a
s
R
a
s
Q
s


 
)
'
,
'
(
max
)
'
,
,
(
)
,
(
)
,
(
'
'
a
s
Q
s
a
s
T
a
s
R
a
s
Q
s
a


 
))
,
(
)
'
,
'
(
max
(
)
,
(
)
,
(
'
a
s
Q
a
s
Q
r
a
s
Q
a
s
Q
a



 

)
,
(
max
arg a
s
Q
a
a

TD-Q learning agent algorithm
For each pair (s, a), initialize Q(s,a)
Observe the current state s
Loop forever
{
Select an action a and execute it
Receive immediate reward r and observe the new state s’
Update Q(s,a)
s=s’
}
)
,
(
max
arg a
s
Q
a
a

))
,
(
)
'
,
'
(
max
(
)
,
(
)
,
(
'
a
s
Q
a
s
Q
r
a
s
Q
a
s
Q
a



 

 An action has two kinds of outcome
 Gain rewards on the current experience
tuple (s,a,s’)
 Affect the percepts received, and hence
the ability of the agent to learn
Exploration problem in Active
learning
Exploration problem in Active
learning
 A trade off when choosing action between
 its immediately good(reflected in its current utility estimates using the
what we have learned)
 its long term good(exploring more about the environment help it to
behave optimally in the long run)
 Two extreme approaches
 “wacky”approach: acts randomly, in the hope that it will eventually
explore the entire environment.
 “greedy”approach: acts to maximize its utility using current model
estimate
See Figure 20.10
 Just like human in the real world! People need to decide between
 Continuing in a comfortable existence
 Or striking out into the unknown in the hopes of discovering a new
and better life
Exploration problem in Active
learning
 One kind of solution: the agent should be more wacky when it has
little idea of the environment, and more greedy when it has a
model that is close to being correct
 In a given state, the agent should give some weight to actions that it
has not tried very often.
 While tend to avoid actions that are believed to be of low utility
 Implemented by exploration function f(u,n):
 assigning a higher utility estimate to relatively unexplored action state
pairs
 Chang the updating rule of value function to
 U+ denote the optimistic estimate of the utility
))
,
(
),
'
(
)
'
,
,
(
(
)
,
(
(
max
)
(
'
s
a
N
s
U
s
a
s
T
f
a
s
r
s
U
s
a




 
Exploration problem in Active
learning
 One kind of definition of f(u,n)
if n< Ne
u otherwise
 is an optimistic estimate of the best possible reward
obtainable in any state
 The agent will try each action-state pair(s,a) at least Ne times
 The agent will behave initially as if there were wonderful rewards
scattered all over around– optimistic .

)
,
( n
u
f

R


R
Generalization in
Reinforcement Learning
 So far we assumed that all the functions learned by
the agent are (U, T, R,Q) are tabular forms—
i.e.. It is possible to enumerate state and action
spaces.
 Use generalization techniques to deal with large state
or action space.
 Function approximation techniques
Genetic algorithm and Evolutionary
programming
 Start with a set of individuals
 Apply selection and reproduction operators to “evolve” an individual that is
successful (measured by a fitness function)
Genetic algorithm and Evolutionary
programming
 Imagine the individuals as agent functions
 Fitness function as performance measure or reward
function
 No attempt made to learn the relationship the
rewards and actions taken by an agent
 Simply searches directly in the individual space to
find one that maximizes the fitness functions
Genetic algorithm and Evolutionary
programming
 Represent an individual as a binary string(each bit of the string is called a gene)
 Selection works like this: if individual X scores twice as high as Y on the fitness
function, then X is twice likely to be selected for reproduction than Y is
 Reproduction is accomplished by cross-over and mutation
Thank you!

More Related Content

Similar to Reinforcement Learning.ppt

reinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdfreinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdfVaishnavGhadge1
 
Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning Chandra Meena
 
Lecture notes
Lecture notesLecture notes
Lecture notesbutest
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningSVijaylakshmi
 
lecture_21.pptx - PowerPoint Presentation
lecture_21.pptx - PowerPoint Presentationlecture_21.pptx - PowerPoint Presentation
lecture_21.pptx - PowerPoint Presentationbutest
 
RL_Dr.SNR Final ppt for Presentation 28.05.2021.pptx
RL_Dr.SNR Final ppt for Presentation 28.05.2021.pptxRL_Dr.SNR Final ppt for Presentation 28.05.2021.pptx
RL_Dr.SNR Final ppt for Presentation 28.05.2021.pptxdeeplearning6
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017MLconf
 
M Harmon RL Tutorial
M Harmon RL TutorialM Harmon RL Tutorial
M Harmon RL TutorialMance Harmon
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement LearningNatan Katz
 
24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptxManiMaran230751
 
Agents-and-Problem-Solving-20022024-094442am.pdf
Agents-and-Problem-Solving-20022024-094442am.pdfAgents-and-Problem-Solving-20022024-094442am.pdf
Agents-and-Problem-Solving-20022024-094442am.pdfsyedhasanali293
 
AUTOMATIC TRANSFER RATE ADJUSTMENT FOR TRANSFER REINFORCEMENT LEARNING
AUTOMATIC TRANSFER RATE ADJUSTMENT FOR TRANSFER REINFORCEMENT LEARNINGAUTOMATIC TRANSFER RATE ADJUSTMENT FOR TRANSFER REINFORCEMENT LEARNING
AUTOMATIC TRANSFER RATE ADJUSTMENT FOR TRANSFER REINFORCEMENT LEARNINGgerogepatton
 
Lecture 1 - introduction.pdf
Lecture 1 - introduction.pdfLecture 1 - introduction.pdf
Lecture 1 - introduction.pdfNamanJain758248
 
Hierarchical Reinforcement Learning
Hierarchical Reinforcement LearningHierarchical Reinforcement Learning
Hierarchical Reinforcement Learningahmad bassiouny
 
Markov decision process
Markov decision processMarkov decision process
Markov decision processchauhankapil
 

Similar to Reinforcement Learning.ppt (20)

reinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdfreinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdf
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning
 
Lecture notes
Lecture notesLecture notes
Lecture notes
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
lecture_21.pptx - PowerPoint Presentation
lecture_21.pptx - PowerPoint Presentationlecture_21.pptx - PowerPoint Presentation
lecture_21.pptx - PowerPoint Presentation
 
Cs221 rl
Cs221 rlCs221 rl
Cs221 rl
 
RL_Dr.SNR Final ppt for Presentation 28.05.2021.pptx
RL_Dr.SNR Final ppt for Presentation 28.05.2021.pptxRL_Dr.SNR Final ppt for Presentation 28.05.2021.pptx
RL_Dr.SNR Final ppt for Presentation 28.05.2021.pptx
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
 
ML_lec1.pdf
ML_lec1.pdfML_lec1.pdf
ML_lec1.pdf
 
Intro rl
Intro rlIntro rl
Intro rl
 
M Harmon RL Tutorial
M Harmon RL TutorialM Harmon RL Tutorial
M Harmon RL Tutorial
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement Learning
 
24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx
 
Agents-and-Problem-Solving-20022024-094442am.pdf
Agents-and-Problem-Solving-20022024-094442am.pdfAgents-and-Problem-Solving-20022024-094442am.pdf
Agents-and-Problem-Solving-20022024-094442am.pdf
 
AUTOMATIC TRANSFER RATE ADJUSTMENT FOR TRANSFER REINFORCEMENT LEARNING
AUTOMATIC TRANSFER RATE ADJUSTMENT FOR TRANSFER REINFORCEMENT LEARNINGAUTOMATIC TRANSFER RATE ADJUSTMENT FOR TRANSFER REINFORCEMENT LEARNING
AUTOMATIC TRANSFER RATE ADJUSTMENT FOR TRANSFER REINFORCEMENT LEARNING
 
CS3013 -MACHINE LEARNING.pptx
CS3013 -MACHINE LEARNING.pptxCS3013 -MACHINE LEARNING.pptx
CS3013 -MACHINE LEARNING.pptx
 
Lecture 1 - introduction.pdf
Lecture 1 - introduction.pdfLecture 1 - introduction.pdf
Lecture 1 - introduction.pdf
 
Hierarchical Reinforcement Learning
Hierarchical Reinforcement LearningHierarchical Reinforcement Learning
Hierarchical Reinforcement Learning
 
Markov decision process
Markov decision processMarkov decision process
Markov decision process
 

Recently uploaded

Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfSumit Tiwari
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
Science lesson Moon for 4th quarter lesson
Science lesson Moon for 4th quarter lessonScience lesson Moon for 4th quarter lesson
Science lesson Moon for 4th quarter lessonJericReyAuditor
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
ENGLISH5 QUARTER4 MODULE1 WEEK1-3 How Visual and Multimedia Elements.pptx
ENGLISH5 QUARTER4 MODULE1 WEEK1-3 How Visual and Multimedia Elements.pptxENGLISH5 QUARTER4 MODULE1 WEEK1-3 How Visual and Multimedia Elements.pptx
ENGLISH5 QUARTER4 MODULE1 WEEK1-3 How Visual and Multimedia Elements.pptxAnaBeatriceAblay2
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxAvyJaneVismanos
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
Class 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfClass 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfakmcokerachita
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application ) Sakshi Ghasle
 
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,Virag Sontakke
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
Painted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaPainted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaVirag Sontakke
 

Recently uploaded (20)

Staff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSDStaff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSD
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
Science lesson Moon for 4th quarter lesson
Science lesson Moon for 4th quarter lessonScience lesson Moon for 4th quarter lesson
Science lesson Moon for 4th quarter lesson
 
9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
ENGLISH5 QUARTER4 MODULE1 WEEK1-3 How Visual and Multimedia Elements.pptx
ENGLISH5 QUARTER4 MODULE1 WEEK1-3 How Visual and Multimedia Elements.pptxENGLISH5 QUARTER4 MODULE1 WEEK1-3 How Visual and Multimedia Elements.pptx
ENGLISH5 QUARTER4 MODULE1 WEEK1-3 How Visual and Multimedia Elements.pptx
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptx
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
Class 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfClass 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdf
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application )
 
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
Painted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaPainted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of India
 

Reinforcement Learning.ppt

  • 2. What is Machine learning?  Machine Learning takes place as a result of interaction between an agent and the world, the idea behind machine learning is that  Percepts received by an agent should be used not only for acting, but also for improving the agent’s ability to behave optimally in the future to achieve the goal.
  • 3. Machine Learning types  Machine Learning types  Supervised learning: a situation in which sample (input, output) pairs of the function to be learned can be perceived or are given  You can think it as if there is a kind teacher  Reinforcement learning: in the case of the agent acts on its environment, it receives some evaluation of its action (reinforcement), but is not told of which action is the correct one to achieve its goal
  • 4. Reinforcement learning  Task Learn how to behave successfully to achieve a goal while interacting with an external environment  Learn via experiences!  Examples  Game playing: player knows whether it win or lose, but not know how to move at each step  Control: a traffic system can measure the delay of cars, but not know how to decrease it.
  • 5. RL is learning from interaction
  • 6. RL model  Each percept(e) is enough to determine the State(the state is accessible)  The agent can decompose the Reward component from a percept.  The agent task: to find a optimal policy, mapping states to actions, that maximize long-run measure of the reinforcement  Think of reinforcement as reward  Can be modeled as MDP model!
  • 7. Review of MDP model  MDP model <S,T,A,R> Agent Environment State Reward Action s0 r0 a0 s1 a1 r1 s2 a2 r2 s3 • S– set of states • A– set of actions • T(s,a,s’) = P(s’|s,a)– the probability of transition from s to s’ given action a • R(s,a)– the expected reward for taking action a in state s     ' ' ) ' , , ( ) ' , , ( ) , ( ) ' , , ( ) , | ' ( ) , ( s s s a s r s a s T a s R s a s r a s s P a s R
  • 8. Model based v.s.Model free approaches  But, we don’t know anything about the environment model—the transition function T(s,a,s’)  Here comes two approaches  Model based approach RL: learn the model, and use it to derive the optimal policy. e.g Adaptive dynamic learning(ADP) approach  Model free approach RL: derive the optimal policy without learning the model. e.g LMS and Temporal difference approach  Which one is better?
  • 9. Passive learning v.s. Active learning  Passive learning  The agent imply watches the world going by and tries to learn the utilities of being in various states  Active learning  The agent not simply watches, but also acts
  • 11. Passive learning scenario  The agent see the the sequences of state transitions and associate rewards  The environment generates state transitions and the agent perceive them e.g (1,1) (1,2) (1,3) (2,3) (3,3) (4,3)[+1] (1,1)(1,2) (1,3) (1,2) (1,3) (1,2) (1,1) (2,1) (3,1) (4,1) (4,2)[-1]  Key idea: updating the utility value using the given training sequences.
  • 13. LMS updating  Reward to go of a state the sum of the rewards from that state until a terminal state is reached  Key: use observed reward to go of the state as the direct evidence of the actual expected utility of that state  Learning utility function directly from sequence example
  • 14. LMS updating function LMS-UPDATE (U, e, percepts, M, N ) return an updated U if TERMINAL?[e] then { reward-to-go  0 for each ei in percepts (starting from end) do s = STATE[ei] reward-to-go  reward-to-go + REWARS[ei] U[s] = RUNNING-AVERAGE (U[s], reward-to-go, N[s]) end } function RUNNING-AVERAGE (U[s], reward-to-go, N[s] ) U[s] = [ U[s] * (N[s] – 1) + reward-to-go ] / N[s]
  • 15. LMS updating algorithm in passive learning  Drawback:  The actual utility of a state is constrained to be probability- weighted average of its successor’s utilities.  Converge very slowly to correct utilities values (requires a lot of sequences)  for our example, >1000!
  • 16. Temporal difference method in passive learning  TD(0) key idea:  adjust the estimated utility value of the current state based on its immediately reward and the estimated value of the next state.  The updating rule  is the learning rate parameter  Only when is a function that decreases as the number of times a state has been visited increased, then can U(s)converge to the correct value. )) ( ) ' ( ) ( ( ) ( ) ( s U s U s R s U s U       
  • 17. The TD learning curve (4,3) (2,3) (2,2) (1,1) (3,1) (4,1) (4,2)
  • 18. Adaptive dynamic programming(ADP) in passive learning  Different with LMS and TD method(model free approaches)  ADP is a model based approach!  The updating rule for passive learning  However, in an unknown environment, T is not given, the agent must learn T itself by experiences with the environment.  How to learn T? )) ' ( ) ' , ( ( ) ' , ( ) ( ' s U s s r s s T s U s    
  • 20. Active learning  An active agent must consider  what actions to take?  what their outcomes maybe(both on learning and receiving the rewards in the long run)?  Update utility equation  Rule to chose action )) ' ( ) ' , , ( ) , ( ( max arg ' s U s a s T a s R a s a     )) ' ( ) ' , , ( ) , ( ( max ) ( ' s U s a s T a s R s U s a    
  • 21. Active ADP algorithm For each s, initialize U(s) , T(s,a,s’) and R(s,a) Initialize s to current state that is perceived Loop forever { Select an action a and execute it (using current model R and T) using Receive immediate reward r and observe the new state s’ Using the transition tuple <s,a,s’,r> to update model R and T (see further) For all the sate s, update U(s) using the updating rule s = s’ } )) ' ( ) ' , , ( ) , ( ( max arg ' s U s a s T a s R a s a     )) ' ( ) ' , , ( ) , ( ( max ) ( ' s U s a s T a s R s U s a    
  • 22. How to learn model?  Use the transition tuple <s, a, s’, r> to learn T(s,a,s’) and R(s,a). That’s supervised learning!  Since the agent can get every transition (s, a, s’,r) directly, so take (s,a)/s’ as an input/output example of the transition probability function T.  Different techniques in the supervised learning(see further reading for detail)  Use r and T(s,a,s’) to learn R(s,a)   ' ) ' , , ( ) , ( s r s a s T a s R
  • 23. ADP approach pros and cons  Pros:  ADP algorithm converges far faster than LMS and Temporal learning. That is because it use the information from the the model of the environment.  Cons:  Intractable for large state space  In each step, update U for all states  Improve this by prioritized-sweeping (see further reading for detail)
  • 24. Another model free method– TD-Q learning  Define Q-value function  Q-value function updating rule <*>  Key idea of TD-Q learning  Combined with temporal difference approach  The updating rule  Rule to chose the action to take ) , ( max ) ( a s Q s U a  )) ' ( ) ' , , ( ) , ( ( max ) ( ' s U s a s T a s R s U s a     ) ' ( ) ' , , ( ) , ( ) , ( ' s U s a s T a s R a s Q s     ) ' , ' ( max ) ' , , ( ) , ( ) , ( ' ' a s Q s a s T a s R a s Q s a     )) , ( ) ' , ' ( max ( ) , ( ) , ( ' a s Q a s Q r a s Q a s Q a       ) , ( max arg a s Q a a 
  • 25. TD-Q learning agent algorithm For each pair (s, a), initialize Q(s,a) Observe the current state s Loop forever { Select an action a and execute it Receive immediate reward r and observe the new state s’ Update Q(s,a) s=s’ } ) , ( max arg a s Q a a  )) , ( ) ' , ' ( max ( ) , ( ) , ( ' a s Q a s Q r a s Q a s Q a      
  • 26.  An action has two kinds of outcome  Gain rewards on the current experience tuple (s,a,s’)  Affect the percepts received, and hence the ability of the agent to learn Exploration problem in Active learning
  • 27. Exploration problem in Active learning  A trade off when choosing action between  its immediately good(reflected in its current utility estimates using the what we have learned)  its long term good(exploring more about the environment help it to behave optimally in the long run)  Two extreme approaches  “wacky”approach: acts randomly, in the hope that it will eventually explore the entire environment.  “greedy”approach: acts to maximize its utility using current model estimate See Figure 20.10  Just like human in the real world! People need to decide between  Continuing in a comfortable existence  Or striking out into the unknown in the hopes of discovering a new and better life
  • 28. Exploration problem in Active learning  One kind of solution: the agent should be more wacky when it has little idea of the environment, and more greedy when it has a model that is close to being correct  In a given state, the agent should give some weight to actions that it has not tried very often.  While tend to avoid actions that are believed to be of low utility  Implemented by exploration function f(u,n):  assigning a higher utility estimate to relatively unexplored action state pairs  Chang the updating rule of value function to  U+ denote the optimistic estimate of the utility )) , ( ), ' ( ) ' , , ( ( ) , ( ( max ) ( ' s a N s U s a s T f a s r s U s a      
  • 29. Exploration problem in Active learning  One kind of definition of f(u,n) if n< Ne u otherwise  is an optimistic estimate of the best possible reward obtainable in any state  The agent will try each action-state pair(s,a) at least Ne times  The agent will behave initially as if there were wonderful rewards scattered all over around– optimistic .  ) , ( n u f  R   R
  • 30. Generalization in Reinforcement Learning  So far we assumed that all the functions learned by the agent are (U, T, R,Q) are tabular forms— i.e.. It is possible to enumerate state and action spaces.  Use generalization techniques to deal with large state or action space.  Function approximation techniques
  • 31. Genetic algorithm and Evolutionary programming  Start with a set of individuals  Apply selection and reproduction operators to “evolve” an individual that is successful (measured by a fitness function)
  • 32. Genetic algorithm and Evolutionary programming  Imagine the individuals as agent functions  Fitness function as performance measure or reward function  No attempt made to learn the relationship the rewards and actions taken by an agent  Simply searches directly in the individual space to find one that maximizes the fitness functions
  • 33. Genetic algorithm and Evolutionary programming  Represent an individual as a binary string(each bit of the string is called a gene)  Selection works like this: if individual X scores twice as high as Y on the fitness function, then X is twice likely to be selected for reproduction than Y is  Reproduction is accomplished by cross-over and mutation