SlideShare a Scribd company logo
UNIT – I
REINFORCEMENT
LEARNING
9/12/2023 1
Basics of Probability and Linear Algebra:
Probability:
 Experiment: An action or procedure that produces outcomes.
 Sample Space (S): The set of all possible outcomes.
 Event: A subset of the sample space.
 Probability: A measure of the likelihood of an event, represented as a number between 0 and 1, inclusive.
Basic Concepts:
 Complementary Events: If A is an event, then its complement A′ is the event not occurring.
 Conditional Probability: P(A∣B) is the probability of event A given that event B has occurred.
 Independence: Two events A and B are independent if P(A∩B)=P(A)P(B).
Basics of Probability and Linear Algebra:
Linear Algebra:
 Vector: An ordered list of numbers.
 Matrix: A rectangular array of numbers.
 Dot Product: Given two vectors u and v, their dot product is given by u⋅v=u1v1​+u2v2​+...+unvn.
 Matrix Multiplication: The product of two matrices A and B is another matrix whose elements are formed by
taking the dot product of the rows of A with the columns of B.
Definition of a Stochastic Multi-Armed Bandit
• A multi-armed bandit problem is a classical optimization problem where an agent interacts with multiple options
(the "arms") and at each interaction, the agent must choose which arm to "pull" to receive a reward. The reward
is drawn from a probability distribution associated with the chosen arm, but the distributions are initially
unknown to the agent.
• Stochastic: The term "stochastic" implies that the rewards are random variables with some underlying (but
unknown) probability distribution.
Definition of Regret
Sample Footer Text
• Regret is a measure of the difference between the reward an agent receives by following its policy and the reward
it would have received by always selecting the best arm. Mathematically:
9/12/2023 5
Definition of Regret
In our slot machine scenario, regret is the difference between:
• The total reward you'd get if you always played the best machine (knowing future outcomes, which you can't)
and the actual reward you get.
• It's a measure of how well you're making decisions compared to the "best possible" decisions
Achieving Sublinear Regret:
• Sublinear regret implies that the average regret per round goes to zero as the number of rounds T increases. For a
bandit algorithm, achieving sublinear regret means that, in the long run, the performance of the algorithm
approaches the performance of the best arm.
• As you play more, your average regret per play should decrease if you're making good decisions. If your regret
grows slower than the number of times you play (sublinearly), then, over time, you're getting close to the best
possible outcome
UCB Algorithm (Upper Confidence Bound):
Sample Footer Text
• UCB (Upper Confidence Bound) algorithm balances exploration (trying out all arms) with exploitation (playing the
best-known arms). At each step, it chooses the arm with the highest upper confidence bound:
9/12/2023 8
UCB Algorithm (Upper Confidence Bound):
• This is a strategy to deal with the slot machines. Rather than just considering the average reward of each
machine, it also considers how uncertain we are about each machine's reward. It then plays the machine that has
the highest potential to be the best.
• The formula gives an "optimistic" estimate of each machine's potential.
KL-UCB:
Sample Footer Text
• KL-UCB is an extension of UCB, incorporating the Kullback-Leibler (KL) divergence to determine the
confidence intervals. Instead of the confidence interval from UCB, KL-UCB uses the equation:
• It is a refined version of UCB. Instead of a simple measure of uncertainty, it uses the Kullback-Leibler
divergence, a concept from information theory. This can sometimes give better estimates of each
machine's potential
9/12/2023 10
Thompson Sampling:
• Thompson Sampling is a Bayesian approach to the bandit problem. For each arm:
• Maintain a posterior distribution over the expected reward of that arm.
• At each step, sample from the posterior distribution of each arm.
• Choose the arm with the highest sample.
• The key to Thompson Sampling is the updating of the posterior distribution based on observed rewards. It
naturally balances exploration and exploitation through the stochastic nature of the sampling process.
Sample Footer Text
THANK YOU
9/12/2023 12

More Related Content

Similar to UNIT - I Reinforcement Learning .pptx

Practical AI for Business: Bandit Algorithms
Practical AI for Business: Bandit AlgorithmsPractical AI for Business: Bandit Algorithms
Practical AI for Business: Bandit Algorithms
SC5.io
 
Ch13 slides
Ch13 slidesCh13 slides
Ch13 slides
fentaw leykun
 
Chapter 06.ppt
Chapter 06.pptChapter 06.ppt
Chapter 06.ppt
HCCTAndTechnologycom
 
Generate and test random numbers
Generate and test random numbersGenerate and test random numbers
Generate and test random numbers
Mshari Alabdulkarim
 
03 notes
03 notes03 notes
Monte Carlo Simulations (UC Berkeley School of Information; July 11, 2019)
Monte Carlo Simulations (UC Berkeley School of Information; July 11, 2019)Monte Carlo Simulations (UC Berkeley School of Information; July 11, 2019)
Monte Carlo Simulations (UC Berkeley School of Information; July 11, 2019)
Ivan Corneillet
 
Monte Carlo Simulation lecture.pdf
Monte Carlo Simulation lecture.pdfMonte Carlo Simulation lecture.pdf
Monte Carlo Simulation lecture.pdf
WellingtonIsraelQuim
 
Game theoretic concepts in Support Vector Machines
Game theoretic concepts in Support Vector MachinesGame theoretic concepts in Support Vector Machines
Game theoretic concepts in Support Vector Machines
Subhayan Mukerjee
 
Supervised learning
Supervised learningSupervised learning
Supervised learning
Johnson Ubah
 
Machine learning mathematicals.pdf
Machine learning mathematicals.pdfMachine learning mathematicals.pdf
Machine learning mathematicals.pdf
King Khalid University
 
Estimators and observers-Optimal Control
Estimators and observers-Optimal ControlEstimators and observers-Optimal Control
Estimators and observers-Optimal Control
Wissam Kafa
 
HMM & R & FK
HMM & R & FKHMM & R & FK
HMM & R & FK
陳 柏宏
 
Simple lin regress_inference
Simple lin regress_inferenceSimple lin regress_inference
Simple lin regress_inference
Kemal İnciroğlu
 
Advanced Econometrics L5-6.pptx
Advanced Econometrics L5-6.pptxAdvanced Econometrics L5-6.pptx
Advanced Econometrics L5-6.pptx
akashayosha
 
working with python
working with pythonworking with python
working with python
bhavesh lande
 
Robustness Metrics for ML Models based on Deep Learning Methods
Robustness Metrics for ML Models based on Deep Learning MethodsRobustness Metrics for ML Models based on Deep Learning Methods
Robustness Metrics for ML Models based on Deep Learning Methods
Data Science Milan
 
Predict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an OrganizationPredict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an Organization
Piyush Srivastava
 
Data mining Part 1
Data mining Part 1Data mining Part 1
Data mining Part 1
Gautam Kumar
 
CounterFactual Explanations.pdf
CounterFactual Explanations.pdfCounterFactual Explanations.pdf
CounterFactual Explanations.pdf
Bong-Ho Lee
 
Cost Optimized Design Technique for Pseudo-Random Numbers in Cellular Automata
Cost Optimized Design Technique for Pseudo-Random Numbers in Cellular AutomataCost Optimized Design Technique for Pseudo-Random Numbers in Cellular Automata
Cost Optimized Design Technique for Pseudo-Random Numbers in Cellular Automata
ijait
 

Similar to UNIT - I Reinforcement Learning .pptx (20)

Practical AI for Business: Bandit Algorithms
Practical AI for Business: Bandit AlgorithmsPractical AI for Business: Bandit Algorithms
Practical AI for Business: Bandit Algorithms
 
Ch13 slides
Ch13 slidesCh13 slides
Ch13 slides
 
Chapter 06.ppt
Chapter 06.pptChapter 06.ppt
Chapter 06.ppt
 
Generate and test random numbers
Generate and test random numbersGenerate and test random numbers
Generate and test random numbers
 
03 notes
03 notes03 notes
03 notes
 
Monte Carlo Simulations (UC Berkeley School of Information; July 11, 2019)
Monte Carlo Simulations (UC Berkeley School of Information; July 11, 2019)Monte Carlo Simulations (UC Berkeley School of Information; July 11, 2019)
Monte Carlo Simulations (UC Berkeley School of Information; July 11, 2019)
 
Monte Carlo Simulation lecture.pdf
Monte Carlo Simulation lecture.pdfMonte Carlo Simulation lecture.pdf
Monte Carlo Simulation lecture.pdf
 
Game theoretic concepts in Support Vector Machines
Game theoretic concepts in Support Vector MachinesGame theoretic concepts in Support Vector Machines
Game theoretic concepts in Support Vector Machines
 
Supervised learning
Supervised learningSupervised learning
Supervised learning
 
Machine learning mathematicals.pdf
Machine learning mathematicals.pdfMachine learning mathematicals.pdf
Machine learning mathematicals.pdf
 
Estimators and observers-Optimal Control
Estimators and observers-Optimal ControlEstimators and observers-Optimal Control
Estimators and observers-Optimal Control
 
HMM & R & FK
HMM & R & FKHMM & R & FK
HMM & R & FK
 
Simple lin regress_inference
Simple lin regress_inferenceSimple lin regress_inference
Simple lin regress_inference
 
Advanced Econometrics L5-6.pptx
Advanced Econometrics L5-6.pptxAdvanced Econometrics L5-6.pptx
Advanced Econometrics L5-6.pptx
 
working with python
working with pythonworking with python
working with python
 
Robustness Metrics for ML Models based on Deep Learning Methods
Robustness Metrics for ML Models based on Deep Learning MethodsRobustness Metrics for ML Models based on Deep Learning Methods
Robustness Metrics for ML Models based on Deep Learning Methods
 
Predict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an OrganizationPredict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an Organization
 
Data mining Part 1
Data mining Part 1Data mining Part 1
Data mining Part 1
 
CounterFactual Explanations.pdf
CounterFactual Explanations.pdfCounterFactual Explanations.pdf
CounterFactual Explanations.pdf
 
Cost Optimized Design Technique for Pseudo-Random Numbers in Cellular Automata
Cost Optimized Design Technique for Pseudo-Random Numbers in Cellular AutomataCost Optimized Design Technique for Pseudo-Random Numbers in Cellular Automata
Cost Optimized Design Technique for Pseudo-Random Numbers in Cellular Automata
 

Recently uploaded

How to Build a Module in Odoo 17 Using the Scaffold Method
How to Build a Module in Odoo 17 Using the Scaffold MethodHow to Build a Module in Odoo 17 Using the Scaffold Method
How to Build a Module in Odoo 17 Using the Scaffold Method
Celine George
 
Cognitive Development Adolescence Psychology
Cognitive Development Adolescence PsychologyCognitive Development Adolescence Psychology
Cognitive Development Adolescence Psychology
paigestewart1632
 
Smart-Money for SMC traders good time and ICT
Smart-Money for SMC traders good time and ICTSmart-Money for SMC traders good time and ICT
Smart-Money for SMC traders good time and ICT
simonomuemu
 
PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.
Dr. Shivangi Singh Parihar
 
The simplified electron and muon model, Oscillating Spacetime: The Foundation...
The simplified electron and muon model, Oscillating Spacetime: The Foundation...The simplified electron and muon model, Oscillating Spacetime: The Foundation...
The simplified electron and muon model, Oscillating Spacetime: The Foundation...
RitikBhardwaj56
 
A Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdfA Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdf
Jean Carlos Nunes Paixão
 
Digital Artifact 1 - 10VCD Environments Unit
Digital Artifact 1 - 10VCD Environments UnitDigital Artifact 1 - 10VCD Environments Unit
Digital Artifact 1 - 10VCD Environments Unit
chanes7
 
Life upper-Intermediate B2 Workbook for student
Life upper-Intermediate B2 Workbook for studentLife upper-Intermediate B2 Workbook for student
Life upper-Intermediate B2 Workbook for student
NgcHiNguyn25
 
Natural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama UniversityNatural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama University
Akanksha trivedi rama nursing college kanpur.
 
How to Fix the Import Error in the Odoo 17
How to Fix the Import Error in the Odoo 17How to Fix the Import Error in the Odoo 17
How to Fix the Import Error in the Odoo 17
Celine George
 
clinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdfclinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdf
Priyankaranawat4
 
writing about opinions about Australia the movie
writing about opinions about Australia the moviewriting about opinions about Australia the movie
writing about opinions about Australia the movie
Nicholas Montgomery
 
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdfবাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
eBook.com.bd (প্রয়োজনীয় বাংলা বই)
 
How to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP ModuleHow to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP Module
Celine George
 
Digital Artefact 1 - Tiny Home Environmental Design
Digital Artefact 1 - Tiny Home Environmental DesignDigital Artefact 1 - Tiny Home Environmental Design
Digital Artefact 1 - Tiny Home Environmental Design
amberjdewit93
 
The History of Stoke Newington Street Names
The History of Stoke Newington Street NamesThe History of Stoke Newington Street Names
The History of Stoke Newington Street Names
History of Stoke Newington
 
DRUGS AND ITS classification slide share
DRUGS AND ITS classification slide shareDRUGS AND ITS classification slide share
DRUGS AND ITS classification slide share
taiba qazi
 
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdfANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
Priyankaranawat4
 
How to Manage Your Lost Opportunities in Odoo 17 CRM
How to Manage Your Lost Opportunities in Odoo 17 CRMHow to Manage Your Lost Opportunities in Odoo 17 CRM
How to Manage Your Lost Opportunities in Odoo 17 CRM
Celine George
 
The basics of sentences session 6pptx.pptx
The basics of sentences session 6pptx.pptxThe basics of sentences session 6pptx.pptx
The basics of sentences session 6pptx.pptx
heathfieldcps1
 

Recently uploaded (20)

How to Build a Module in Odoo 17 Using the Scaffold Method
How to Build a Module in Odoo 17 Using the Scaffold MethodHow to Build a Module in Odoo 17 Using the Scaffold Method
How to Build a Module in Odoo 17 Using the Scaffold Method
 
Cognitive Development Adolescence Psychology
Cognitive Development Adolescence PsychologyCognitive Development Adolescence Psychology
Cognitive Development Adolescence Psychology
 
Smart-Money for SMC traders good time and ICT
Smart-Money for SMC traders good time and ICTSmart-Money for SMC traders good time and ICT
Smart-Money for SMC traders good time and ICT
 
PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.
 
The simplified electron and muon model, Oscillating Spacetime: The Foundation...
The simplified electron and muon model, Oscillating Spacetime: The Foundation...The simplified electron and muon model, Oscillating Spacetime: The Foundation...
The simplified electron and muon model, Oscillating Spacetime: The Foundation...
 
A Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdfA Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdf
 
Digital Artifact 1 - 10VCD Environments Unit
Digital Artifact 1 - 10VCD Environments UnitDigital Artifact 1 - 10VCD Environments Unit
Digital Artifact 1 - 10VCD Environments Unit
 
Life upper-Intermediate B2 Workbook for student
Life upper-Intermediate B2 Workbook for studentLife upper-Intermediate B2 Workbook for student
Life upper-Intermediate B2 Workbook for student
 
Natural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama UniversityNatural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama University
 
How to Fix the Import Error in the Odoo 17
How to Fix the Import Error in the Odoo 17How to Fix the Import Error in the Odoo 17
How to Fix the Import Error in the Odoo 17
 
clinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdfclinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdf
 
writing about opinions about Australia the movie
writing about opinions about Australia the moviewriting about opinions about Australia the movie
writing about opinions about Australia the movie
 
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdfবাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
 
How to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP ModuleHow to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP Module
 
Digital Artefact 1 - Tiny Home Environmental Design
Digital Artefact 1 - Tiny Home Environmental DesignDigital Artefact 1 - Tiny Home Environmental Design
Digital Artefact 1 - Tiny Home Environmental Design
 
The History of Stoke Newington Street Names
The History of Stoke Newington Street NamesThe History of Stoke Newington Street Names
The History of Stoke Newington Street Names
 
DRUGS AND ITS classification slide share
DRUGS AND ITS classification slide shareDRUGS AND ITS classification slide share
DRUGS AND ITS classification slide share
 
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdfANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
 
How to Manage Your Lost Opportunities in Odoo 17 CRM
How to Manage Your Lost Opportunities in Odoo 17 CRMHow to Manage Your Lost Opportunities in Odoo 17 CRM
How to Manage Your Lost Opportunities in Odoo 17 CRM
 
The basics of sentences session 6pptx.pptx
The basics of sentences session 6pptx.pptxThe basics of sentences session 6pptx.pptx
The basics of sentences session 6pptx.pptx
 

UNIT - I Reinforcement Learning .pptx

  • 2. Basics of Probability and Linear Algebra: Probability:  Experiment: An action or procedure that produces outcomes.  Sample Space (S): The set of all possible outcomes.  Event: A subset of the sample space.  Probability: A measure of the likelihood of an event, represented as a number between 0 and 1, inclusive. Basic Concepts:  Complementary Events: If A is an event, then its complement A′ is the event not occurring.  Conditional Probability: P(A∣B) is the probability of event A given that event B has occurred.  Independence: Two events A and B are independent if P(A∩B)=P(A)P(B).
  • 3. Basics of Probability and Linear Algebra: Linear Algebra:  Vector: An ordered list of numbers.  Matrix: A rectangular array of numbers.  Dot Product: Given two vectors u and v, their dot product is given by u⋅v=u1v1​+u2v2​+...+unvn.  Matrix Multiplication: The product of two matrices A and B is another matrix whose elements are formed by taking the dot product of the rows of A with the columns of B.
  • 4. Definition of a Stochastic Multi-Armed Bandit • A multi-armed bandit problem is a classical optimization problem where an agent interacts with multiple options (the "arms") and at each interaction, the agent must choose which arm to "pull" to receive a reward. The reward is drawn from a probability distribution associated with the chosen arm, but the distributions are initially unknown to the agent. • Stochastic: The term "stochastic" implies that the rewards are random variables with some underlying (but unknown) probability distribution.
  • 5. Definition of Regret Sample Footer Text • Regret is a measure of the difference between the reward an agent receives by following its policy and the reward it would have received by always selecting the best arm. Mathematically: 9/12/2023 5
  • 6. Definition of Regret In our slot machine scenario, regret is the difference between: • The total reward you'd get if you always played the best machine (knowing future outcomes, which you can't) and the actual reward you get. • It's a measure of how well you're making decisions compared to the "best possible" decisions
  • 7. Achieving Sublinear Regret: • Sublinear regret implies that the average regret per round goes to zero as the number of rounds T increases. For a bandit algorithm, achieving sublinear regret means that, in the long run, the performance of the algorithm approaches the performance of the best arm. • As you play more, your average regret per play should decrease if you're making good decisions. If your regret grows slower than the number of times you play (sublinearly), then, over time, you're getting close to the best possible outcome
  • 8. UCB Algorithm (Upper Confidence Bound): Sample Footer Text • UCB (Upper Confidence Bound) algorithm balances exploration (trying out all arms) with exploitation (playing the best-known arms). At each step, it chooses the arm with the highest upper confidence bound: 9/12/2023 8
  • 9. UCB Algorithm (Upper Confidence Bound): • This is a strategy to deal with the slot machines. Rather than just considering the average reward of each machine, it also considers how uncertain we are about each machine's reward. It then plays the machine that has the highest potential to be the best. • The formula gives an "optimistic" estimate of each machine's potential.
  • 10. KL-UCB: Sample Footer Text • KL-UCB is an extension of UCB, incorporating the Kullback-Leibler (KL) divergence to determine the confidence intervals. Instead of the confidence interval from UCB, KL-UCB uses the equation: • It is a refined version of UCB. Instead of a simple measure of uncertainty, it uses the Kullback-Leibler divergence, a concept from information theory. This can sometimes give better estimates of each machine's potential 9/12/2023 10
  • 11. Thompson Sampling: • Thompson Sampling is a Bayesian approach to the bandit problem. For each arm: • Maintain a posterior distribution over the expected reward of that arm. • At each step, sample from the posterior distribution of each arm. • Choose the arm with the highest sample. • The key to Thompson Sampling is the updating of the posterior distribution based on observed rewards. It naturally balances exploration and exploitation through the stochastic nature of the sampling process.
  • 12. Sample Footer Text THANK YOU 9/12/2023 12