SlideShare a Scribd company logo
Reinforcement Learning:
Markov Chain and Monte Carlo
MSC-IT Part-1
By - Ajay Chaurasiya
1
What is Reinforcement Learning?
 “Teach by experience”
 For each step, an agent will:
Execute an action
Observe a new state
Receive a reward
 Agent takes an action in an environment to maximize a reward
MSC-IT Part-1 By - Ajay Chaurasiya
2
Main points in Reinforcement learning
 Input
 Output
 Training
 The model keeps continues to learn.
 The best solution is decided based on the maximum reward.
MSC-IT Part-1 By - Ajay Chaurasiya
3
 Example: The problem is as follows: We have an agent and a reward, with
many hurdles in between. The agent is supposed to find the best possible
path to reach the reward.
The above image shows robot, diamond and fire. The goal of the robot is to get the
reward that is the diamond and avoid the hurdles that is fire.
The robot learns by trying all the possible paths and then choosing the path which gives
him the reward with the least hurdles. Each right step will give the
robot a reward and each wrong step will subtract the reward of the robot. The total
reward will be calculated when it reaches the final reward that is the diamond.
MSC-IT Part-1 By - Ajay Chaurasiya
4
Markov Chain Learning
 A Markov chain is a probabilistic model.
 It describing a sequence of possible events in which the probability of each
event depends only on the state attained in the previous event. In
continuous-time.
 it is also known as Markov process.
 The Markov property states that the future depends only on the present and
not on the past.
 Moving from one state to another is called transition and its probability is
called a transition probability.
MSC-IT Part-1 By - Ajay Chaurasiya
5
 Example: A robot car wants to travel far, quickly
1. Three states: Cool, Warm, Overheated
2. Two actions: Slow. Fast
3. Going faster gets double reward
Note: In Markov Decision, the probability of going
from one state to another state will always be one.
MSC-IT Part-1 By - Ajay Chaurasiya
6
MDP can be represented by 5 important elements.
 State(S)
 Actions(A)
 Transition Probability(Pᵃₛ₁ₛ₂)
 Reward Probability(Rᵃₛ₁ₛ₂)
 discount factor (γ)
MSC-IT Part-1 By - Ajay Chaurasiya
7
Continue..
 It is based on the action our agent performs, it receives a reward. A reward is nothing
but a numerical value, say, +1 for good action and -1 for a bad action.
 The total amount of rewards the agent receives from the environment is called
returns. We can formulate the total amount of reward as
R(t) = r(t+1)+r(t+2)+r(t+3)+r(t+4)……………+r(Τ)
 Since we don’t have any final state for a continuous task, we can define our
return for continuous tasks as R(t) = r(t+1)+r(t+2)+r(t+3)…+r(Τ) which will sum up to ∞
 That’s why we introduce the notion of a discount factor. We can redefine our return
with a discount factor, as follows R(t) = r(t+1)+γr(t+2)+γ² r(t+3)+ ……………
 The value of the discount factor lies within 0 to 1
Rewards
Discount factor
MSC-IT Part-1 By - Ajay Chaurasiya
8
Monte Carlo Learning
 The Monte Carlo method for reinforcement learning learns directly from
episodes of experience without any prior knowledge of MDP transitions. Here,
the random component is the return or reward.
 Below are key characteristics of Monte Carlo (MC) method:
1. There is no model (agent does not know state MDP transitions)
2. Agent learns from sampled experience
3. learn state value vπ(s) under policy π by experiencing average return from
all sampled episodes (value = average return)
4. only after a complete episode, values are updated
5. There is no bootstrapping
6. Only can be used in episodic problems
MSC-IT Part-1 By - Ajay Chaurasiya
9
Continue..
 In Monte Carlo Method instead of expected return we use empirical return
that agent has sampled based following the policy.
MSC-IT Part-1 By - Ajay Chaurasiya
10
 Example: Gems collection
 Agent follows policy and complete an episode, along the way in each
step it collects rewards in the form of gem. To get state value agent
sum-up all the gems collected after each episode starting from that
state.
MSC-IT Part-1 By - Ajay Chaurasiya
11
Continue..
 Return(Sample 01) = 2 + 1 + 2 + 2 + 1 + 5 = 13 gems
 Return(Sample 02) = 2 + 3 + 1 + 3 + 1 + 5 = 15 gems
 Return(Sample 03) = 2 + 3 + 1 + 3 + 1 + 5 = 15 gems
 Observed mean return (based on 3 samples) = (13 + 15 + 15)/3 = 14.33 gems
 Thus state value as per Monte Carlo Method, v π(S 05) is 14.33 gems based on
3 samples following policy π.
MSC-IT Part-1 By - Ajay Chaurasiya
12
 (even if agent comes-back to the same state multiple time in the episode,
only first visit will be counted). Detailed step as below:
 To evaluate state s, first we set number of visit, N(s) = 0, Total return TR(s) =
0 (these values are updated across episodes)
 The first time-step t that state s is visited in an episode, increment counter
N(s) = N(s) + 1
 Increment total return TR(s) = TR(s) + Gt
 Value is estimated by mean return V(s) = TR(s)/N(s)
 By law of large numbers, V(s) -> vπ(s) (this is called true value under policy π)
as N(s) approaches infinity
MSC-IT Part-1 By - Ajay Chaurasiya
13
Thank You !
MSC-IT Part-1 By - Ajay Chaurasiya
14

More Related Content

Similar to Markov chain and Monte Carlo

Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement Learning
Natan Katz
 
RL.ppt
RL.pptRL.ppt
RL.ppt
AzharJamil15
 
Designing an AI that gains experience for absolute beginners
Designing an AI that gains experience for absolute beginnersDesigning an AI that gains experience for absolute beginners
Designing an AI that gains experience for absolute beginners
Tanzim Saqib
 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game Learning
Prabhu Kumar
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
Usman Qayyum
 
Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning
Chandra Meena
 
REINFORCEMENT LEARNING
REINFORCEMENT LEARNINGREINFORCEMENT LEARNING
REINFORCEMENT LEARNING
pradiprahul
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
Muhammad Iqbal Tawakal
 
Lecture2-MRP.pdf
Lecture2-MRP.pdfLecture2-MRP.pdf
Lecture2-MRP.pdf
NistalaPraneeth
 
NUS-ISS Learning Day 2018-How to train your program to play black jack
NUS-ISS Learning Day 2018-How to train your program to play black jackNUS-ISS Learning Day 2018-How to train your program to play black jack
NUS-ISS Learning Day 2018-How to train your program to play black jack
NUS-ISS
 
S19_lecture6_exploreexploitinbandits.pdf
S19_lecture6_exploreexploitinbandits.pdfS19_lecture6_exploreexploitinbandits.pdf
S19_lecture6_exploreexploitinbandits.pdf
LPrashanthi
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
SVijaylakshmi
 
Head First Reinforcement Learning
Head First Reinforcement LearningHead First Reinforcement Learning
Head First Reinforcement Learning
azzeddine chenine
 
NUS-ISS Learning Day 2019-Introduction to reinforcement learning
NUS-ISS Learning Day 2019-Introduction to reinforcement learningNUS-ISS Learning Day 2019-Introduction to reinforcement learning
NUS-ISS Learning Day 2019-Introduction to reinforcement learning
NUS-ISS
 
Deep einforcement learning
Deep einforcement learningDeep einforcement learning
Deep einforcement learning
OswaldoAndrsOrdezBol
 
reinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdfreinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdf
VaishnavGhadge1
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
MLconf
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
Ding Li
 
Finalver
FinalverFinalver
Finalver
Natan Katz
 

Similar to Markov chain and Monte Carlo (20)

Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement Learning
 
RL.ppt
RL.pptRL.ppt
RL.ppt
 
Designing an AI that gains experience for absolute beginners
Designing an AI that gains experience for absolute beginnersDesigning an AI that gains experience for absolute beginners
Designing an AI that gains experience for absolute beginners
 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game Learning
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
CS799_FinalReport
CS799_FinalReportCS799_FinalReport
CS799_FinalReport
 
Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning
 
REINFORCEMENT LEARNING
REINFORCEMENT LEARNINGREINFORCEMENT LEARNING
REINFORCEMENT LEARNING
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Lecture2-MRP.pdf
Lecture2-MRP.pdfLecture2-MRP.pdf
Lecture2-MRP.pdf
 
NUS-ISS Learning Day 2018-How to train your program to play black jack
NUS-ISS Learning Day 2018-How to train your program to play black jackNUS-ISS Learning Day 2018-How to train your program to play black jack
NUS-ISS Learning Day 2018-How to train your program to play black jack
 
S19_lecture6_exploreexploitinbandits.pdf
S19_lecture6_exploreexploitinbandits.pdfS19_lecture6_exploreexploitinbandits.pdf
S19_lecture6_exploreexploitinbandits.pdf
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Head First Reinforcement Learning
Head First Reinforcement LearningHead First Reinforcement Learning
Head First Reinforcement Learning
 
NUS-ISS Learning Day 2019-Introduction to reinforcement learning
NUS-ISS Learning Day 2019-Introduction to reinforcement learningNUS-ISS Learning Day 2019-Introduction to reinforcement learning
NUS-ISS Learning Day 2019-Introduction to reinforcement learning
 
Deep einforcement learning
Deep einforcement learningDeep einforcement learning
Deep einforcement learning
 
reinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdfreinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdf
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Finalver
FinalverFinalver
Finalver
 

Recently uploaded

Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
Pixlogix Infotech
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
Rohit Gautam
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
Alex Pruden
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 

Recently uploaded (20)

Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 

Markov chain and Monte Carlo

  • 1. Reinforcement Learning: Markov Chain and Monte Carlo MSC-IT Part-1 By - Ajay Chaurasiya 1
  • 2. What is Reinforcement Learning?  “Teach by experience”  For each step, an agent will: Execute an action Observe a new state Receive a reward  Agent takes an action in an environment to maximize a reward MSC-IT Part-1 By - Ajay Chaurasiya 2
  • 3. Main points in Reinforcement learning  Input  Output  Training  The model keeps continues to learn.  The best solution is decided based on the maximum reward. MSC-IT Part-1 By - Ajay Chaurasiya 3
  • 4.  Example: The problem is as follows: We have an agent and a reward, with many hurdles in between. The agent is supposed to find the best possible path to reach the reward. The above image shows robot, diamond and fire. The goal of the robot is to get the reward that is the diamond and avoid the hurdles that is fire. The robot learns by trying all the possible paths and then choosing the path which gives him the reward with the least hurdles. Each right step will give the robot a reward and each wrong step will subtract the reward of the robot. The total reward will be calculated when it reaches the final reward that is the diamond. MSC-IT Part-1 By - Ajay Chaurasiya 4
  • 5. Markov Chain Learning  A Markov chain is a probabilistic model.  It describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event. In continuous-time.  it is also known as Markov process.  The Markov property states that the future depends only on the present and not on the past.  Moving from one state to another is called transition and its probability is called a transition probability. MSC-IT Part-1 By - Ajay Chaurasiya 5
  • 6.  Example: A robot car wants to travel far, quickly 1. Three states: Cool, Warm, Overheated 2. Two actions: Slow. Fast 3. Going faster gets double reward Note: In Markov Decision, the probability of going from one state to another state will always be one. MSC-IT Part-1 By - Ajay Chaurasiya 6
  • 7. MDP can be represented by 5 important elements.  State(S)  Actions(A)  Transition Probability(Pᵃₛ₁ₛ₂)  Reward Probability(Rᵃₛ₁ₛ₂)  discount factor (γ) MSC-IT Part-1 By - Ajay Chaurasiya 7
  • 8. Continue..  It is based on the action our agent performs, it receives a reward. A reward is nothing but a numerical value, say, +1 for good action and -1 for a bad action.  The total amount of rewards the agent receives from the environment is called returns. We can formulate the total amount of reward as R(t) = r(t+1)+r(t+2)+r(t+3)+r(t+4)……………+r(Τ)  Since we don’t have any final state for a continuous task, we can define our return for continuous tasks as R(t) = r(t+1)+r(t+2)+r(t+3)…+r(Τ) which will sum up to ∞  That’s why we introduce the notion of a discount factor. We can redefine our return with a discount factor, as follows R(t) = r(t+1)+γr(t+2)+γ² r(t+3)+ ……………  The value of the discount factor lies within 0 to 1 Rewards Discount factor MSC-IT Part-1 By - Ajay Chaurasiya 8
  • 9. Monte Carlo Learning  The Monte Carlo method for reinforcement learning learns directly from episodes of experience without any prior knowledge of MDP transitions. Here, the random component is the return or reward.  Below are key characteristics of Monte Carlo (MC) method: 1. There is no model (agent does not know state MDP transitions) 2. Agent learns from sampled experience 3. learn state value vπ(s) under policy π by experiencing average return from all sampled episodes (value = average return) 4. only after a complete episode, values are updated 5. There is no bootstrapping 6. Only can be used in episodic problems MSC-IT Part-1 By - Ajay Chaurasiya 9
  • 10. Continue..  In Monte Carlo Method instead of expected return we use empirical return that agent has sampled based following the policy. MSC-IT Part-1 By - Ajay Chaurasiya 10
  • 11.  Example: Gems collection  Agent follows policy and complete an episode, along the way in each step it collects rewards in the form of gem. To get state value agent sum-up all the gems collected after each episode starting from that state. MSC-IT Part-1 By - Ajay Chaurasiya 11
  • 12. Continue..  Return(Sample 01) = 2 + 1 + 2 + 2 + 1 + 5 = 13 gems  Return(Sample 02) = 2 + 3 + 1 + 3 + 1 + 5 = 15 gems  Return(Sample 03) = 2 + 3 + 1 + 3 + 1 + 5 = 15 gems  Observed mean return (based on 3 samples) = (13 + 15 + 15)/3 = 14.33 gems  Thus state value as per Monte Carlo Method, v π(S 05) is 14.33 gems based on 3 samples following policy π. MSC-IT Part-1 By - Ajay Chaurasiya 12
  • 13.  (even if agent comes-back to the same state multiple time in the episode, only first visit will be counted). Detailed step as below:  To evaluate state s, first we set number of visit, N(s) = 0, Total return TR(s) = 0 (these values are updated across episodes)  The first time-step t that state s is visited in an episode, increment counter N(s) = N(s) + 1  Increment total return TR(s) = TR(s) + Gt  Value is estimated by mean return V(s) = TR(s)/N(s)  By law of large numbers, V(s) -> vπ(s) (this is called true value under policy π) as N(s) approaches infinity MSC-IT Part-1 By - Ajay Chaurasiya 13
  • 14. Thank You ! MSC-IT Part-1 By - Ajay Chaurasiya 14