SlideShare a Scribd company logo
1 of 12
UNIT – I
REINFORCEMENT
LEARNING
9/12/2023 1
Basics of Probability and Linear Algebra:
Probability:
 Experiment: An action or procedure that produces outcomes.
 Sample Space (S): The set of all possible outcomes.
 Event: A subset of the sample space.
 Probability: A measure of the likelihood of an event, represented as a number between 0 and 1, inclusive.
Basic Concepts:
 Complementary Events: If A is an event, then its complement A′ is the event not occurring.
 Conditional Probability: P(A∣B) is the probability of event A given that event B has occurred.
 Independence: Two events A and B are independent if P(A∩B)=P(A)P(B).
Basics of Probability and Linear Algebra:
Linear Algebra:
 Vector: An ordered list of numbers.
 Matrix: A rectangular array of numbers.
 Dot Product: Given two vectors u and v, their dot product is given by u⋅v=u1v1​+u2v2​+...+unvn.
 Matrix Multiplication: The product of two matrices A and B is another matrix whose elements are formed by
taking the dot product of the rows of A with the columns of B.
Definition of a Stochastic Multi-Armed Bandit
• A multi-armed bandit problem is a classical optimization problem where an agent interacts with multiple options
(the "arms") and at each interaction, the agent must choose which arm to "pull" to receive a reward. The reward
is drawn from a probability distribution associated with the chosen arm, but the distributions are initially
unknown to the agent.
• Stochastic: The term "stochastic" implies that the rewards are random variables with some underlying (but
unknown) probability distribution.
Definition of Regret
Sample Footer Text
• Regret is a measure of the difference between the reward an agent receives by following its policy and the reward
it would have received by always selecting the best arm. Mathematically:
9/12/2023 5
Definition of Regret
In our slot machine scenario, regret is the difference between:
• The total reward you'd get if you always played the best machine (knowing future outcomes, which you can't)
and the actual reward you get.
• It's a measure of how well you're making decisions compared to the "best possible" decisions
Achieving Sublinear Regret:
• Sublinear regret implies that the average regret per round goes to zero as the number of rounds T increases. For a
bandit algorithm, achieving sublinear regret means that, in the long run, the performance of the algorithm
approaches the performance of the best arm.
• As you play more, your average regret per play should decrease if you're making good decisions. If your regret
grows slower than the number of times you play (sublinearly), then, over time, you're getting close to the best
possible outcome
UCB Algorithm (Upper Confidence Bound):
Sample Footer Text
• UCB (Upper Confidence Bound) algorithm balances exploration (trying out all arms) with exploitation (playing the
best-known arms). At each step, it chooses the arm with the highest upper confidence bound:
9/12/2023 8
UCB Algorithm (Upper Confidence Bound):
• This is a strategy to deal with the slot machines. Rather than just considering the average reward of each
machine, it also considers how uncertain we are about each machine's reward. It then plays the machine that has
the highest potential to be the best.
• The formula gives an "optimistic" estimate of each machine's potential.
KL-UCB:
Sample Footer Text
• KL-UCB is an extension of UCB, incorporating the Kullback-Leibler (KL) divergence to determine the
confidence intervals. Instead of the confidence interval from UCB, KL-UCB uses the equation:
• It is a refined version of UCB. Instead of a simple measure of uncertainty, it uses the Kullback-Leibler
divergence, a concept from information theory. This can sometimes give better estimates of each
machine's potential
9/12/2023 10
Thompson Sampling:
• Thompson Sampling is a Bayesian approach to the bandit problem. For each arm:
• Maintain a posterior distribution over the expected reward of that arm.
• At each step, sample from the posterior distribution of each arm.
• Choose the arm with the highest sample.
• The key to Thompson Sampling is the updating of the posterior distribution based on observed rewards. It
naturally balances exploration and exploitation through the stochastic nature of the sampling process.
Sample Footer Text
THANK YOU
9/12/2023 12

More Related Content

Similar to UNIT - I Reinforcement Learning .pptx

Generate and test random numbers
Generate and test random numbersGenerate and test random numbers
Generate and test random numbers
Mshari Alabdulkarim
 
Simple lin regress_inference
Simple lin regress_inferenceSimple lin regress_inference
Simple lin regress_inference
Kemal İnciroğlu
 

Similar to UNIT - I Reinforcement Learning .pptx (20)

Chapter 06.ppt
Chapter 06.pptChapter 06.ppt
Chapter 06.ppt
 
Generate and test random numbers
Generate and test random numbersGenerate and test random numbers
Generate and test random numbers
 
03 notes
03 notes03 notes
03 notes
 
Monte Carlo Simulations (UC Berkeley School of Information; July 11, 2019)
Monte Carlo Simulations (UC Berkeley School of Information; July 11, 2019)Monte Carlo Simulations (UC Berkeley School of Information; July 11, 2019)
Monte Carlo Simulations (UC Berkeley School of Information; July 11, 2019)
 
Monte Carlo Simulation lecture.pdf
Monte Carlo Simulation lecture.pdfMonte Carlo Simulation lecture.pdf
Monte Carlo Simulation lecture.pdf
 
Game theoretic concepts in Support Vector Machines
Game theoretic concepts in Support Vector MachinesGame theoretic concepts in Support Vector Machines
Game theoretic concepts in Support Vector Machines
 
Supervised learning
Supervised learningSupervised learning
Supervised learning
 
Machine learning mathematicals.pdf
Machine learning mathematicals.pdfMachine learning mathematicals.pdf
Machine learning mathematicals.pdf
 
Estimators and observers-Optimal Control
Estimators and observers-Optimal ControlEstimators and observers-Optimal Control
Estimators and observers-Optimal Control
 
HMM & R & FK
HMM & R & FKHMM & R & FK
HMM & R & FK
 
Simple lin regress_inference
Simple lin regress_inferenceSimple lin regress_inference
Simple lin regress_inference
 
Advanced Econometrics L5-6.pptx
Advanced Econometrics L5-6.pptxAdvanced Econometrics L5-6.pptx
Advanced Econometrics L5-6.pptx
 
working with python
working with pythonworking with python
working with python
 
Robustness Metrics for ML Models based on Deep Learning Methods
Robustness Metrics for ML Models based on Deep Learning MethodsRobustness Metrics for ML Models based on Deep Learning Methods
Robustness Metrics for ML Models based on Deep Learning Methods
 
Predict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an OrganizationPredict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an Organization
 
Data mining Part 1
Data mining Part 1Data mining Part 1
Data mining Part 1
 
CounterFactual Explanations.pdf
CounterFactual Explanations.pdfCounterFactual Explanations.pdf
CounterFactual Explanations.pdf
 
Cost Optimized Design Technique for Pseudo-Random Numbers in Cellular Automata
Cost Optimized Design Technique for Pseudo-Random Numbers in Cellular AutomataCost Optimized Design Technique for Pseudo-Random Numbers in Cellular Automata
Cost Optimized Design Technique for Pseudo-Random Numbers in Cellular Automata
 
WEKA:Credibility Evaluating Whats Been Learned
WEKA:Credibility Evaluating Whats Been LearnedWEKA:Credibility Evaluating Whats Been Learned
WEKA:Credibility Evaluating Whats Been Learned
 
WEKA: Credibility Evaluating Whats Been Learned
WEKA: Credibility Evaluating Whats Been LearnedWEKA: Credibility Evaluating Whats Been Learned
WEKA: Credibility Evaluating Whats Been Learned
 

Recently uploaded

The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 

Recently uploaded (20)

Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptx
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptx
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structure
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
21st_Century_Skills_Framework_Final_Presentation_2.pptx
21st_Century_Skills_Framework_Final_Presentation_2.pptx21st_Century_Skills_Framework_Final_Presentation_2.pptx
21st_Century_Skills_Framework_Final_Presentation_2.pptx
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
FICTIONAL SALESMAN/SALESMAN SNSW 2024.pdf
FICTIONAL SALESMAN/SALESMAN SNSW 2024.pdfFICTIONAL SALESMAN/SALESMAN SNSW 2024.pdf
FICTIONAL SALESMAN/SALESMAN SNSW 2024.pdf
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
OSCM Unit 2_Operations Processes & Systems
OSCM Unit 2_Operations Processes & SystemsOSCM Unit 2_Operations Processes & Systems
OSCM Unit 2_Operations Processes & Systems
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 
How to Add a Tool Tip to a Field in Odoo 17
How to Add a Tool Tip to a Field in Odoo 17How to Add a Tool Tip to a Field in Odoo 17
How to Add a Tool Tip to a Field in Odoo 17
 
Tatlong Kwento ni Lola basyang-1.pdf arts
Tatlong Kwento ni Lola basyang-1.pdf artsTatlong Kwento ni Lola basyang-1.pdf arts
Tatlong Kwento ni Lola basyang-1.pdf arts
 

UNIT - I Reinforcement Learning .pptx

  • 2. Basics of Probability and Linear Algebra: Probability:  Experiment: An action or procedure that produces outcomes.  Sample Space (S): The set of all possible outcomes.  Event: A subset of the sample space.  Probability: A measure of the likelihood of an event, represented as a number between 0 and 1, inclusive. Basic Concepts:  Complementary Events: If A is an event, then its complement A′ is the event not occurring.  Conditional Probability: P(A∣B) is the probability of event A given that event B has occurred.  Independence: Two events A and B are independent if P(A∩B)=P(A)P(B).
  • 3. Basics of Probability and Linear Algebra: Linear Algebra:  Vector: An ordered list of numbers.  Matrix: A rectangular array of numbers.  Dot Product: Given two vectors u and v, their dot product is given by u⋅v=u1v1​+u2v2​+...+unvn.  Matrix Multiplication: The product of two matrices A and B is another matrix whose elements are formed by taking the dot product of the rows of A with the columns of B.
  • 4. Definition of a Stochastic Multi-Armed Bandit • A multi-armed bandit problem is a classical optimization problem where an agent interacts with multiple options (the "arms") and at each interaction, the agent must choose which arm to "pull" to receive a reward. The reward is drawn from a probability distribution associated with the chosen arm, but the distributions are initially unknown to the agent. • Stochastic: The term "stochastic" implies that the rewards are random variables with some underlying (but unknown) probability distribution.
  • 5. Definition of Regret Sample Footer Text • Regret is a measure of the difference between the reward an agent receives by following its policy and the reward it would have received by always selecting the best arm. Mathematically: 9/12/2023 5
  • 6. Definition of Regret In our slot machine scenario, regret is the difference between: • The total reward you'd get if you always played the best machine (knowing future outcomes, which you can't) and the actual reward you get. • It's a measure of how well you're making decisions compared to the "best possible" decisions
  • 7. Achieving Sublinear Regret: • Sublinear regret implies that the average regret per round goes to zero as the number of rounds T increases. For a bandit algorithm, achieving sublinear regret means that, in the long run, the performance of the algorithm approaches the performance of the best arm. • As you play more, your average regret per play should decrease if you're making good decisions. If your regret grows slower than the number of times you play (sublinearly), then, over time, you're getting close to the best possible outcome
  • 8. UCB Algorithm (Upper Confidence Bound): Sample Footer Text • UCB (Upper Confidence Bound) algorithm balances exploration (trying out all arms) with exploitation (playing the best-known arms). At each step, it chooses the arm with the highest upper confidence bound: 9/12/2023 8
  • 9. UCB Algorithm (Upper Confidence Bound): • This is a strategy to deal with the slot machines. Rather than just considering the average reward of each machine, it also considers how uncertain we are about each machine's reward. It then plays the machine that has the highest potential to be the best. • The formula gives an "optimistic" estimate of each machine's potential.
  • 10. KL-UCB: Sample Footer Text • KL-UCB is an extension of UCB, incorporating the Kullback-Leibler (KL) divergence to determine the confidence intervals. Instead of the confidence interval from UCB, KL-UCB uses the equation: • It is a refined version of UCB. Instead of a simple measure of uncertainty, it uses the Kullback-Leibler divergence, a concept from information theory. This can sometimes give better estimates of each machine's potential 9/12/2023 10
  • 11. Thompson Sampling: • Thompson Sampling is a Bayesian approach to the bandit problem. For each arm: • Maintain a posterior distribution over the expected reward of that arm. • At each step, sample from the posterior distribution of each arm. • Choose the arm with the highest sample. • The key to Thompson Sampling is the updating of the posterior distribution based on observed rewards. It naturally balances exploration and exploitation through the stochastic nature of the sampling process.
  • 12. Sample Footer Text THANK YOU 9/12/2023 12