SlideShare a Scribd company logo
1 of 33
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
What is Reinforcement Learning?
 Reinforcement Learning is a feedback-based Machine learning technique in
which an agent learns to behave in an environment by performing the
actions and seeing the results of actions. For each good action, the agent
gets positive feedback, and for each bad action, the agent gets negative
feedback or penalty.
 In Reinforcement Learning, the agent learns automatically using feedbacks
without any labeled data, unlike supervised learning.
 Since there is no labeled data, so the agent is bound to learn by its
experience only.
 RL solves a specific type of problem where decision making is sequential,
and the goal is long-term, such as game-playing, robotics, etc.
 The agent interacts with the environment and explores it by itself. The
primary goal of an agent in reinforcement learning is to improve the
performance by getting the maximum positive rewards.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
 The agent learns with the process of hit and trial, and based on the
experience, it learns to perform the task in a better way. Hence, we can say
that "Reinforcement learning is a type of machine learning method where
an intelligent agent (computer program) interacts with the environment
and learns to act within that." How a Robotic dog learns the movement of
his arms is an example of Reinforcement learning.
 It is a core part of Artificial intelligence, and all AI agent works on the concept
of reinforcement learning. Here we do not need to pre-program the agent, as it
learns from its own experience without any human intervention.
 Example: Suppose there is an AI agent present within a maze environment,
and his goal is to find the diamond. The agent interacts with the environment
by performing some actions, and based on those actions, the state of the
agent gets changed, and it also receives a reward or penalty as feedback.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
 The agent continues doing these three things (take action, change
state/remain in the same state, and get feedback), and by doing these
actions, he learns and explores the environment.
 The agent learns that what actions lead to positive feedback or rewards and
what actions lead to negative feedback penalty. As a positive reward, the
agent gets a positive point, and as a penalty, it gets a negative point.
Terms used in Reinforcement Learning
 Agent(): An entity that can perceive/explore the environment and act upon it.
 Environment(): A situation in which an agent is present or surrounded by. In RL,
we assume the stochastic environment, which means it is random in nature.
 Action(): Actions are the moves taken by an agent within the environment.
 State(): State is a situation returned by the environment after each action taken
by the agent.
 Reward(): A feedback returned to the agent from the environment to evaluate the
action of the agent.
 Policy(): Policy is a strategy applied by the agent for the next action based on
the current state.
 Value(): It is expected long-term retuned with the discount factor and opposite to
the short-term reward.
 Q-value(): It is mostly similar to the value, but it takes one additional parameter
as a current action (a).
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
Types of Reinforcement learning :
There are mainly two types of reinforcement learning, which are:
Positive Reinforcement:
 The positive reinforcement learning means adding something to increase the
tendency that expected behavior would occur again. It impacts positively on
the behavior of the agent and increases the strength of the behavior.
 This type of reinforcement can sustain the changes for a long time, but too
much positive reinforcement may lead to an overload of states that can
reduce the consequences.
Negative Reinforcement:
 The negative reinforcement learning is opposite to the positive reinforcement
as it increases the tendency that the specific behavior will occur again by
avoiding the negative condition.
 It can be more effective than the positive reinforcement depending on
situation and behavior, but it provides reinforcement only to meet minimum
behavior.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
 RL is mainly used to train Robots for industrial automation.
 Business strategy planning are decided with RL. Bonsai is one of several
startups building tools to enable companies to use RL and other techniques
for industrial applications.
 Education and training: Online platforms are beginning to experiment with
machine learning to create personalized experiences. Several researchers
are exploring the use of RL and other machine learning methods in tutoring
systems and personalized learning. The use of RL can lead to training
systems that provide custom instruction and materials tuned to the needs of
individual students. A group of researchers are developing RL algorithms
and statistical methods that require less data for use in future tutoring
systems.
 Machine learning and data processing:AutoML from Google uses RL to
produce state-of-the-art machine-generated neural network architectures for
computer vision and language modelling.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
 Health and medicine: The RL setup of an agent interacting with an
environment receiving feedback based on actions taken, shares similarities
with the problem of learning treatment policies in the medical sciences. In
fact, many RL applications in health care mostly relate to finding optimal
treatment policies. Recent papers mentioned applications of RL to usage of
medical equipment, medication dosing, and two-stage clinical trials.
 Aircraft control and robot motion control
 Text, speech, and dialog systems: Companies collect a lot of text, and good
tools that can help unlock unstructured text will find users. AI researchers at
SalesForce used deep RL for abstractive text summarization (a technique
for automatically generating summaries from text based on content
“abstracted” from some original text document). This could be an area where
RL-based tools gain new users, as many companies are in need of better
text mining solutions.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
 RL is also being used to allow dialog systems (i.e., chatbots) to learn from
user interactions and thus help them improve over time (many enterprise
chatbots currently rely on decision trees). This is an active area of research
and VC investments: see Semantic Machines and VocalIQ—acquired by
Apple.
 Media and advertising: Microsoft recently described an internal system
called Decision Service that has since been made available on Azure. This
paper describes applications of Decision Service to content recommendation
and advertising. Decision Service more generally targets machine learning
products that suffer from failure modes including “feedback loops and bias,
distributed data collection, changes in the environment, and weak monitoring
and debugging.”
 Other applications of RL include cross-channel marketing optimization and
real time bidding systems for online display advertising.
 Finance: A Financial Times article pronounced an RL-based system for
optimal trade execution. The system (dubbed “LOXM”) is being used to
perform trading orders at maximum speed and at the best possible price.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
 There are two important learning models in reinforcement learning:
1. Markov Decision Process (MDP):
 Most of the reinforcement learning problems can be formalized using MDP.
In MDP, the agent constantly interacts with the environment tand performs
some actions. At each action, the environment responds and generates a
new state by making a decision.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
 MDP contains a tuples of four elements (S, A, Pa, Ra):
 A set of finite States S
 A set of finite Actions A
 Rewards received after transitioning from state S to state S', due to action a.
 Probability Pa.
 MDP uses Markov property which states that "If the agent is present in the
current state S1, and by performing some action a1 moves to the state s2,
then the state transition from S1to S2depends only on the current state and
future action. Next state does not depend on past actions, rewards, or
states."
 Alternatively, the current state transition does not depend on any past action
or state. Hence, MDP in an RL problems such as Chess game, the players
that satisfies the Markov property, only focus on the current state and do not
need to remember past actions or states.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
 In RL, we consider only the finite MDP, as there are finite states, finite rewards,
and finite actions.
 Therefore, Markov Process is a memory less process with a sequence of random
states S1, S2, ....., St that uses the Markov Property. Markov process is also
known as Markov chain, which is a tuples (S, P) on state S and transition function
P. These two components (S and P) can define the dynamics of the s
2. Q-Learning :
 The Q stands for quality in Q-learning, which means it specifies the quality of an
action taken by the agent. It is a value-based method of supplying information to
inform which action an agent should take.
• Q-learning algorithm is used for the temporal difference Learning. The
temporal difference learning methods are the way of comparing temporally
successive predictions.
• It learns the value function Q (S, a), which means how good to take action "a"
at a particular state "s."
• The below flowchart explains the working of Q- learning:
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
 A Q-table or matrix is created while performing the Q-learning. The table
follows the state and action pair, i.e., [s, a], and initializes the values to zero.
After each action, the table is updated, and the q-values are stored within
the table.
 The RL agent uses this Q-table as a reference table to select the best action
based on the q-values.
 To understand the concept, consider an example of a maze environment
that the agent needs to explore as shown in following figure:
Fig.: Maze Game
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
 In the above image consider, the agent (Blue ball) is at the very first block
(S9) of the maze. The maze consisting of S6 block, which is a wall (shown by
Red box), S8 a pit, and S4 a target block (where agent has to reach).
 The agent cannot cross the S6 block, as it is a solid wall. If the agent
reaches the S4 block, then get the +1 reward; if it reaches the pit, then
gets -1 reward point. It can take four actions: move up, move down, move
left, and move right.
 The agent can take any path to reach to the final point, but he needs to
make it in possible fewer steps. Suppose the agent considers the path S9-
S5-S1-S2-S3, so he will get the +1 -reward point.
 The agent will try to remember the preceding steps that it has taken to reach
the final step. To memorize the steps, it assigns 1 value to each previous
step:
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
Fig.: Maze Game
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
 Now, the agent has successfully stored the previous steps assigning the 1
value to each previous block. But what will the agent do if he starts moving
from the block, which has 1 value block on both sides? As shown in
following diagram:
Fig.: Maze Game
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
 It will be a difficult condition for the agent whether he should go up or down
as each block has the same value. So, the above approach is not suitable
for the agent to reach the destination. Hence to solve the problem, we will
use the Bellman equation, which is the main concept behind reinforcement
learning.
 To perform any action, the agent will get a reward R(s, a), and also he will
end up on a certain state, so the Q -value equation is given by Bellman
equation which is:
V(s) = max [R(s,a) + γV(s`)]
Where,
V(s)= value calculated at a particular point.
R(s,a) = Reward at a particular state, s by performing an action.
γ = Discount factor
V(s`) = The value at the previous state.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
The key-elements used in Bellman equations are:
 Action performed by the agent is referred to as "a"
 State occurred by performing the action is "s."
 The reward/feedback obtained for each good and bad action is "R."
 A discount factor is Gamma "γ."
 In the above equation, we are taking the max of the complete values
because the agent tries to find the optimal solution always.
 So now, using the Bellman equation, we will find value at each state of the
given environment. We will start from the block, which is next to the target
block.
For 1st block:
V(s3) = max [R(s,a) + γV(s`)], here V(s')= 0 because there is no further state
to move.
V(s3)= max[R(s,a)]=> V(s3)= max[1]=> V(s3)= 1.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
For 2nd block:
V(s2) = max [R(s,a) + γV(s`)], here γ= 0.9(assumed), V(s')= 1, and R(s, a)= 0,
because there is no reward at this state.
V(s2)= max[0.9(1)]=> V(s)= max[0.9]=> V(s2) =0.9
For 3rd block:
V(s1) = max [R(s,a) + γV(s`)], here γ= 0.9(assumed), V(s')= 0.9, and R(s, a)= 0,
because there is no reward at this state also.
V(s1)= max[0.9(0.9)]=> V(s3)= max[0.81]=> V(s1) =0.81
For 4th block:
V(s5) = max [R(s,a) + γV(s`)], here γ= 0.9(assumed), V(s')= 0.81, and R(s, a)=
0, because there is no reward at this state also.
V(s5)= max[0.9(0.81)]=> V(s5)= max[0.81]=> V(s5) =0.73
For 5th block:
V(s9) = max [R(s,a) + γV(s`)], here γ= 0.9(assumed), V(s')= 0.73, and R(s, a)=
0, because there is no reward at this state also.
V(s9)= max[0.9(0.73)]=> V(s4)= max[0.81]=> V(s4) =0.66
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
 Now, Consider the below image of maze environment:
 Now, we will move further to the S7 block, and here agent may change the
route because it always tries to find the optimal path. Agent has three
options to move; if he moves to the red box, then he will feel a bump if he
moves to the pit, then he will get the -1 reward. But here we are taking only
positive rewards, so for this, he will move to upwards only and then to target
box i. e. S4 and thus finally achieve the target. The complete block values
will be calculated using this formula as shown in following figure
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
Fig.: Maze Game
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
Thompson Sampling :
 Thompson Sampling is an algorithm that follows exploration and exploitation to
maximize the cumulative rewards obtained by performing an action. Thompson
Sampling is also sometimes referred to as Posterior Sampling or Probability
Matching.
 An action is performed multiple times which is called exploration and based on
the results obtained from the actions, either rewards or penalties, further
actions are performed with the goal to maximize the reward which is called
exploitation. In other words, new choices are explored to maximize rewards
while exploiting the already explored choices.
 One of the first and the best examples to explain the Thompson Sampling
method was the Multi-Armed Bandit problem
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
Steps of the algorithm:
Steps are by considering our own version of a multi-armed bandit problem.
Here suppose we have a dataset of online advertising campaign where an
advertise has 10 different versions. Now our task is to find the best advertise
using the Thompson sampling algorithm.
Step 1. At each round n, we consider two numbers for each advertise i :
N’i(n): Number of times the advertise i got reward 1 up to round n,
N”i(n): Number of times the advertise i got reward 0 up to round n,
Step 2.For each advertise i, take a random draw from the distribution below:
θi(n) = β(N’i(n) + 1 , N”i(n)+1)
Step 3. We select the advertise that has the maximum θi(n)
At the first step, the total number of rewards(either 0 or 1) is counted.
To understand the second step, we need to study at the Bayesian inference
rules, as follows:
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
Bayesian Inference Rule:
 Advertise i gets reward y from Bernoulli distribution p( θi|y) ≈ β(θi)
 θi is unknown but set its uncertainty by assuming it has a uniform distribution
p(θi) ≈ u([0,1]), which is the prior distribution.
 Bayes Rule: Approach θi by the posterior distribution:
 We get p( θi|y) ≈ β(Number of successes +1 , Number of failures + 1)
 At each round n, we take a random draw θi(n) from this posterior distribution
p( θi|y), for each advertise i.
 At each round n we select the advertise i that has the highest θi(n).
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
Applications Of Thompson Sampling:
Thompson Sampling algorithm has been around for a long time. It was first
proposed in 1933 by William R. Thompson. Though the algorithm has been
ignored and has had the least attention for many years after its proposal, it
has recently gained traction in many areas of Artificial Intelligence. Some of
the areas where it has been used are revenue management, marketing, web
site optimization, A/B testing, advertisement, recommendation system,
gaming and more.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
 Unlike Thompson sampling which is a probabilistic algorithm
meaning that the success rate distribution of the bandits was
calculated based on the probability distribution. UCB is a
deterministic algorithm meaning that there is no factor of uncertainty
or probability.
 UCB is a deterministic algorithm for Reinforcement Learning that
focuses on exploration and exploitation based on a confidence
boundary that the algorithm assigns to each machine on each round
of exploration. (A round is when a player pulls the arm of a machine)
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
Consider there are 5 bandits or slot machines namely B1, B2, B3, B4 and B5.
Given the 5 machines, using UCB we are going to devise a sequence of playing
the machines in a way to maximize the yield or rewards from the machines.
Given below are the steps behind UCB for maximizing the rewards in a MABP:
Step 1: Each machine is assumed to have a uniform Confidence Interval and a
success distribution. This Confidence Interval is a margin of success rate
distributions which is the most certain to consist of the actual success rate
distribution of each machine which we are unaware of in the beginning.
Step 2: A machine is randomly chosen to play, as initially, they have all the same
confidence Intervals.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
Step 3: Based on whether the machine gave a reward or not, the Confidence
Interval shifts either towards or away from the actual success distribution
and the also converges or shrinks as it has been explored thus resulting in
the Upper bound value of the confidence Interval to also be reduced.
Step 4: Based on the current Upper Confidence bounds of each of the
machines, the one with the highest is chosen to explore in the next round.
Step 5: Steps 3 and 4 are continued until there are sufficient observations to
determine the upper confidence bound of each machine. The one with the
highest upper confidence bound is the machine with the highest success
rate.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
Given below is the algorithm inside UCB that updates the Confidence bounds
of each machine after each round.
Step 1: Two values are considered for each round of exploration of a machine
The number of times each machine has been selected till round n
The sum of rewards collected by each machine till round n
Step 2: At each round, we compute the average reward and the confidence
interval of the machine i up to n rounds as follows:
Average reward :
Step 3: The machine with the maximum UCB is selected.
UCB:
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
Confidence Interval :
Difference between Reinforcement Learning and Supervised Learning :
Reinforcement Learning Supervised Learning
RL works by interacting with the
environment.
Supervised learning works on the
existing dataset.
The RL algorithm works like the
human brain works when making
some decisions.
Supervised Learning works as when a
human learns things in the
supervision of a guide.
There is no labeled dataset is present The labeled dataset is present.
No previous training is provided to the
learning agent.
Training is provided to the algorithm
so that it can predict the output.
RL helps to take decisions
sequentially.
In Supervised learning, decisions are
made when input is given.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
Thanks !!!
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune

More Related Content

Similar to Introduction to Reinforcement Learning.pptx

Job Simulation- Whitepaper from Hiring Simulation Assessments
Job Simulation- Whitepaper from Hiring Simulation AssessmentsJob Simulation- Whitepaper from Hiring Simulation Assessments
Job Simulation- Whitepaper from Hiring Simulation AssessmentsHire Results
 
reiniforcement learning.ppt
reiniforcement learning.pptreiniforcement learning.ppt
reiniforcement learning.pptcharusharma165
 
RL_Dr.SNR Final ppt for Presentation 28.05.2021.pptx
RL_Dr.SNR Final ppt for Presentation 28.05.2021.pptxRL_Dr.SNR Final ppt for Presentation 28.05.2021.pptx
RL_Dr.SNR Final ppt for Presentation 28.05.2021.pptxdeeplearning6
 
Reinforcement Learning.ppt
Reinforcement Learning.pptReinforcement Learning.ppt
Reinforcement Learning.pptPOOJASHREEC1
 
acai01-updated.ppt
acai01-updated.pptacai01-updated.ppt
acai01-updated.pptbutest
 
Intro of Machine Learning Models .pptx
Intro of Machine  Learning  Models .pptxIntro of Machine  Learning  Models .pptx
Intro of Machine Learning Models .pptxHarsha Patel
 
Occupational Therapy and Reinforcement (part 3)
Occupational Therapy and Reinforcement (part 3)Occupational Therapy and Reinforcement (part 3)
Occupational Therapy and Reinforcement (part 3)Stephan Van Breenen
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learningbutest
 
A learning agent.doc
A learning agent.docA learning agent.doc
A learning agent.docjdinfo444
 
Harm van Seijen, Research Scientist, Maluuba at MLconf SF 2016
Harm van Seijen, Research Scientist, Maluuba at MLconf SF 2016Harm van Seijen, Research Scientist, Maluuba at MLconf SF 2016
Harm van Seijen, Research Scientist, Maluuba at MLconf SF 2016MLconf
 
FUNDAMENTALS OF MACHINE LEARNING & IT’S TYPES
FUNDAMENTALS OF MACHINE LEARNING & IT’S TYPESFUNDAMENTALS OF MACHINE LEARNING & IT’S TYPES
FUNDAMENTALS OF MACHINE LEARNING & IT’S TYPESBhimsen Joshi
 
Introduction to reinforcement learning
Introduction to reinforcement learningIntroduction to reinforcement learning
Introduction to reinforcement learningPramod Ramachandra
 
Psychometric Test to understand Behavior
Psychometric Test to understand BehaviorPsychometric Test to understand Behavior
Psychometric Test to understand Behaviorsavinder83
 
Introduction to Regression . pptx
Introduction     to    Regression . pptxIntroduction     to    Regression . pptx
Introduction to Regression . pptxHarsha Patel
 
JOB EVALUATION SYSTEMS
JOB EVALUATION SYSTEMSJOB EVALUATION SYSTEMS
JOB EVALUATION SYSTEMSNYASHA MANDE
 
Fba and fa powerpoint
Fba and fa powerpointFba and fa powerpoint
Fba and fa powerpointjmcullenbsu
 
I need to respond to 2 classmates in the discussion board. Must be.docx
I need to respond to 2 classmates in the discussion board. Must be.docxI need to respond to 2 classmates in the discussion board. Must be.docx
I need to respond to 2 classmates in the discussion board. Must be.docxwilcockiris
 

Similar to Introduction to Reinforcement Learning.pptx (20)

Job Simulation- Whitepaper from Hiring Simulation Assessments
Job Simulation- Whitepaper from Hiring Simulation AssessmentsJob Simulation- Whitepaper from Hiring Simulation Assessments
Job Simulation- Whitepaper from Hiring Simulation Assessments
 
reiniforcement learning.ppt
reiniforcement learning.pptreiniforcement learning.ppt
reiniforcement learning.ppt
 
RL_Dr.SNR Final ppt for Presentation 28.05.2021.pptx
RL_Dr.SNR Final ppt for Presentation 28.05.2021.pptxRL_Dr.SNR Final ppt for Presentation 28.05.2021.pptx
RL_Dr.SNR Final ppt for Presentation 28.05.2021.pptx
 
Reinforcement Learning.ppt
Reinforcement Learning.pptReinforcement Learning.ppt
Reinforcement Learning.ppt
 
acai01-updated.ppt
acai01-updated.pptacai01-updated.ppt
acai01-updated.ppt
 
Intro of Machine Learning Models .pptx
Intro of Machine  Learning  Models .pptxIntro of Machine  Learning  Models .pptx
Intro of Machine Learning Models .pptx
 
Occupational Therapy and Reinforcement (part 3)
Occupational Therapy and Reinforcement (part 3)Occupational Therapy and Reinforcement (part 3)
Occupational Therapy and Reinforcement (part 3)
 
inductive human biases.pptx
inductive human biases.pptxinductive human biases.pptx
inductive human biases.pptx
 
Man101 Chapter6
Man101 Chapter6Man101 Chapter6
Man101 Chapter6
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
C1803031825
C1803031825C1803031825
C1803031825
 
A learning agent.doc
A learning agent.docA learning agent.doc
A learning agent.doc
 
Harm van Seijen, Research Scientist, Maluuba at MLconf SF 2016
Harm van Seijen, Research Scientist, Maluuba at MLconf SF 2016Harm van Seijen, Research Scientist, Maluuba at MLconf SF 2016
Harm van Seijen, Research Scientist, Maluuba at MLconf SF 2016
 
FUNDAMENTALS OF MACHINE LEARNING & IT’S TYPES
FUNDAMENTALS OF MACHINE LEARNING & IT’S TYPESFUNDAMENTALS OF MACHINE LEARNING & IT’S TYPES
FUNDAMENTALS OF MACHINE LEARNING & IT’S TYPES
 
Introduction to reinforcement learning
Introduction to reinforcement learningIntroduction to reinforcement learning
Introduction to reinforcement learning
 
Psychometric Test to understand Behavior
Psychometric Test to understand BehaviorPsychometric Test to understand Behavior
Psychometric Test to understand Behavior
 
Introduction to Regression . pptx
Introduction     to    Regression . pptxIntroduction     to    Regression . pptx
Introduction to Regression . pptx
 
JOB EVALUATION SYSTEMS
JOB EVALUATION SYSTEMSJOB EVALUATION SYSTEMS
JOB EVALUATION SYSTEMS
 
Fba and fa powerpoint
Fba and fa powerpointFba and fa powerpoint
Fba and fa powerpoint
 
I need to respond to 2 classmates in the discussion board. Must be.docx
I need to respond to 2 classmates in the discussion board. Must be.docxI need to respond to 2 classmates in the discussion board. Must be.docx
I need to respond to 2 classmates in the discussion board. Must be.docx
 

More from Harsha Patel

Introduction to Association Rules.pptx
Introduction  to  Association  Rules.pptxIntroduction  to  Association  Rules.pptx
Introduction to Association Rules.pptxHarsha Patel
 
Introduction to Clustering . pptx
Introduction    to     Clustering . pptxIntroduction    to     Clustering . pptx
Introduction to Clustering . pptxHarsha Patel
 
Introduction to Classification . pptx
Introduction  to   Classification . pptxIntroduction  to   Classification . pptx
Introduction to Classification . pptxHarsha Patel
 
Introduction to Machine Learning.pptx
Introduction  to  Machine  Learning.pptxIntroduction  to  Machine  Learning.pptx
Introduction to Machine Learning.pptxHarsha Patel
 
Unit-V-Introduction to Data Mining.pptx
Unit-V-Introduction to  Data Mining.pptxUnit-V-Introduction to  Data Mining.pptx
Unit-V-Introduction to Data Mining.pptxHarsha Patel
 
Unit-IV-Introduction to Data Warehousing .pptx
Unit-IV-Introduction to Data Warehousing .pptxUnit-IV-Introduction to Data Warehousing .pptx
Unit-IV-Introduction to Data Warehousing .pptxHarsha Patel
 
Unit-III-AI Search Techniques and solution's
Unit-III-AI Search Techniques and solution'sUnit-III-AI Search Techniques and solution's
Unit-III-AI Search Techniques and solution'sHarsha Patel
 
Unit-II-Introduction of Artifiial Intelligence.pptx
Unit-II-Introduction of Artifiial Intelligence.pptxUnit-II-Introduction of Artifiial Intelligence.pptx
Unit-II-Introduction of Artifiial Intelligence.pptxHarsha Patel
 
Unit-I-Introduction to Recent Trends.pptx
Unit-I-Introduction to Recent Trends.pptxUnit-I-Introduction to Recent Trends.pptx
Unit-I-Introduction to Recent Trends.pptxHarsha Patel
 
Using Unix Commands.pptx
Using Unix Commands.pptxUsing Unix Commands.pptx
Using Unix Commands.pptxHarsha Patel
 
Using Vi Editor.pptx
Using Vi Editor.pptxUsing Vi Editor.pptx
Using Vi Editor.pptxHarsha Patel
 
Shell Scripting and Programming.pptx
Shell Scripting and Programming.pptxShell Scripting and Programming.pptx
Shell Scripting and Programming.pptxHarsha Patel
 
Managing Processes in Unix.pptx
Managing Processes in Unix.pptxManaging Processes in Unix.pptx
Managing Processes in Unix.pptxHarsha Patel
 
Introduction to Unix Concets.pptx
Introduction to Unix Concets.pptxIntroduction to Unix Concets.pptx
Introduction to Unix Concets.pptxHarsha Patel
 
Handling Files Under Unix.pptx
Handling Files Under Unix.pptxHandling Files Under Unix.pptx
Handling Files Under Unix.pptxHarsha Patel
 
Introduction to OS.pptx
Introduction to OS.pptxIntroduction to OS.pptx
Introduction to OS.pptxHarsha Patel
 
Using Unix Commands.pptx
Using Unix Commands.pptxUsing Unix Commands.pptx
Using Unix Commands.pptxHarsha Patel
 
Introduction to Unix Concets.pptx
Introduction to Unix Concets.pptxIntroduction to Unix Concets.pptx
Introduction to Unix Concets.pptxHarsha Patel
 
Shell Scripting and Programming.pptx
Shell Scripting and Programming.pptxShell Scripting and Programming.pptx
Shell Scripting and Programming.pptxHarsha Patel
 
Using Vi Editor.pptx
Using Vi Editor.pptxUsing Vi Editor.pptx
Using Vi Editor.pptxHarsha Patel
 

More from Harsha Patel (20)

Introduction to Association Rules.pptx
Introduction  to  Association  Rules.pptxIntroduction  to  Association  Rules.pptx
Introduction to Association Rules.pptx
 
Introduction to Clustering . pptx
Introduction    to     Clustering . pptxIntroduction    to     Clustering . pptx
Introduction to Clustering . pptx
 
Introduction to Classification . pptx
Introduction  to   Classification . pptxIntroduction  to   Classification . pptx
Introduction to Classification . pptx
 
Introduction to Machine Learning.pptx
Introduction  to  Machine  Learning.pptxIntroduction  to  Machine  Learning.pptx
Introduction to Machine Learning.pptx
 
Unit-V-Introduction to Data Mining.pptx
Unit-V-Introduction to  Data Mining.pptxUnit-V-Introduction to  Data Mining.pptx
Unit-V-Introduction to Data Mining.pptx
 
Unit-IV-Introduction to Data Warehousing .pptx
Unit-IV-Introduction to Data Warehousing .pptxUnit-IV-Introduction to Data Warehousing .pptx
Unit-IV-Introduction to Data Warehousing .pptx
 
Unit-III-AI Search Techniques and solution's
Unit-III-AI Search Techniques and solution'sUnit-III-AI Search Techniques and solution's
Unit-III-AI Search Techniques and solution's
 
Unit-II-Introduction of Artifiial Intelligence.pptx
Unit-II-Introduction of Artifiial Intelligence.pptxUnit-II-Introduction of Artifiial Intelligence.pptx
Unit-II-Introduction of Artifiial Intelligence.pptx
 
Unit-I-Introduction to Recent Trends.pptx
Unit-I-Introduction to Recent Trends.pptxUnit-I-Introduction to Recent Trends.pptx
Unit-I-Introduction to Recent Trends.pptx
 
Using Unix Commands.pptx
Using Unix Commands.pptxUsing Unix Commands.pptx
Using Unix Commands.pptx
 
Using Vi Editor.pptx
Using Vi Editor.pptxUsing Vi Editor.pptx
Using Vi Editor.pptx
 
Shell Scripting and Programming.pptx
Shell Scripting and Programming.pptxShell Scripting and Programming.pptx
Shell Scripting and Programming.pptx
 
Managing Processes in Unix.pptx
Managing Processes in Unix.pptxManaging Processes in Unix.pptx
Managing Processes in Unix.pptx
 
Introduction to Unix Concets.pptx
Introduction to Unix Concets.pptxIntroduction to Unix Concets.pptx
Introduction to Unix Concets.pptx
 
Handling Files Under Unix.pptx
Handling Files Under Unix.pptxHandling Files Under Unix.pptx
Handling Files Under Unix.pptx
 
Introduction to OS.pptx
Introduction to OS.pptxIntroduction to OS.pptx
Introduction to OS.pptx
 
Using Unix Commands.pptx
Using Unix Commands.pptxUsing Unix Commands.pptx
Using Unix Commands.pptx
 
Introduction to Unix Concets.pptx
Introduction to Unix Concets.pptxIntroduction to Unix Concets.pptx
Introduction to Unix Concets.pptx
 
Shell Scripting and Programming.pptx
Shell Scripting and Programming.pptxShell Scripting and Programming.pptx
Shell Scripting and Programming.pptx
 
Using Vi Editor.pptx
Using Vi Editor.pptxUsing Vi Editor.pptx
Using Vi Editor.pptx
 

Recently uploaded

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 

Recently uploaded (20)

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 

Introduction to Reinforcement Learning.pptx

  • 1. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 2. What is Reinforcement Learning?  Reinforcement Learning is a feedback-based Machine learning technique in which an agent learns to behave in an environment by performing the actions and seeing the results of actions. For each good action, the agent gets positive feedback, and for each bad action, the agent gets negative feedback or penalty.  In Reinforcement Learning, the agent learns automatically using feedbacks without any labeled data, unlike supervised learning.  Since there is no labeled data, so the agent is bound to learn by its experience only.  RL solves a specific type of problem where decision making is sequential, and the goal is long-term, such as game-playing, robotics, etc.  The agent interacts with the environment and explores it by itself. The primary goal of an agent in reinforcement learning is to improve the performance by getting the maximum positive rewards. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 3. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune  The agent learns with the process of hit and trial, and based on the experience, it learns to perform the task in a better way. Hence, we can say that "Reinforcement learning is a type of machine learning method where an intelligent agent (computer program) interacts with the environment and learns to act within that." How a Robotic dog learns the movement of his arms is an example of Reinforcement learning.  It is a core part of Artificial intelligence, and all AI agent works on the concept of reinforcement learning. Here we do not need to pre-program the agent, as it learns from its own experience without any human intervention.  Example: Suppose there is an AI agent present within a maze environment, and his goal is to find the diamond. The agent interacts with the environment by performing some actions, and based on those actions, the state of the agent gets changed, and it also receives a reward or penalty as feedback.
  • 4. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune  The agent continues doing these three things (take action, change state/remain in the same state, and get feedback), and by doing these actions, he learns and explores the environment.  The agent learns that what actions lead to positive feedback or rewards and what actions lead to negative feedback penalty. As a positive reward, the agent gets a positive point, and as a penalty, it gets a negative point.
  • 5. Terms used in Reinforcement Learning  Agent(): An entity that can perceive/explore the environment and act upon it.  Environment(): A situation in which an agent is present or surrounded by. In RL, we assume the stochastic environment, which means it is random in nature.  Action(): Actions are the moves taken by an agent within the environment.  State(): State is a situation returned by the environment after each action taken by the agent.  Reward(): A feedback returned to the agent from the environment to evaluate the action of the agent.  Policy(): Policy is a strategy applied by the agent for the next action based on the current state.  Value(): It is expected long-term retuned with the discount factor and opposite to the short-term reward.  Q-value(): It is mostly similar to the value, but it takes one additional parameter as a current action (a). Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 6. Types of Reinforcement learning : There are mainly two types of reinforcement learning, which are: Positive Reinforcement:  The positive reinforcement learning means adding something to increase the tendency that expected behavior would occur again. It impacts positively on the behavior of the agent and increases the strength of the behavior.  This type of reinforcement can sustain the changes for a long time, but too much positive reinforcement may lead to an overload of states that can reduce the consequences. Negative Reinforcement:  The negative reinforcement learning is opposite to the positive reinforcement as it increases the tendency that the specific behavior will occur again by avoiding the negative condition.  It can be more effective than the positive reinforcement depending on situation and behavior, but it provides reinforcement only to meet minimum behavior. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 7.  RL is mainly used to train Robots for industrial automation.  Business strategy planning are decided with RL. Bonsai is one of several startups building tools to enable companies to use RL and other techniques for industrial applications.  Education and training: Online platforms are beginning to experiment with machine learning to create personalized experiences. Several researchers are exploring the use of RL and other machine learning methods in tutoring systems and personalized learning. The use of RL can lead to training systems that provide custom instruction and materials tuned to the needs of individual students. A group of researchers are developing RL algorithms and statistical methods that require less data for use in future tutoring systems.  Machine learning and data processing:AutoML from Google uses RL to produce state-of-the-art machine-generated neural network architectures for computer vision and language modelling. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 8.  Health and medicine: The RL setup of an agent interacting with an environment receiving feedback based on actions taken, shares similarities with the problem of learning treatment policies in the medical sciences. In fact, many RL applications in health care mostly relate to finding optimal treatment policies. Recent papers mentioned applications of RL to usage of medical equipment, medication dosing, and two-stage clinical trials.  Aircraft control and robot motion control  Text, speech, and dialog systems: Companies collect a lot of text, and good tools that can help unlock unstructured text will find users. AI researchers at SalesForce used deep RL for abstractive text summarization (a technique for automatically generating summaries from text based on content “abstracted” from some original text document). This could be an area where RL-based tools gain new users, as many companies are in need of better text mining solutions. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 9.  RL is also being used to allow dialog systems (i.e., chatbots) to learn from user interactions and thus help them improve over time (many enterprise chatbots currently rely on decision trees). This is an active area of research and VC investments: see Semantic Machines and VocalIQ—acquired by Apple.  Media and advertising: Microsoft recently described an internal system called Decision Service that has since been made available on Azure. This paper describes applications of Decision Service to content recommendation and advertising. Decision Service more generally targets machine learning products that suffer from failure modes including “feedback loops and bias, distributed data collection, changes in the environment, and weak monitoring and debugging.”  Other applications of RL include cross-channel marketing optimization and real time bidding systems for online display advertising.  Finance: A Financial Times article pronounced an RL-based system for optimal trade execution. The system (dubbed “LOXM”) is being used to perform trading orders at maximum speed and at the best possible price. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 10.  There are two important learning models in reinforcement learning: 1. Markov Decision Process (MDP):  Most of the reinforcement learning problems can be formalized using MDP. In MDP, the agent constantly interacts with the environment tand performs some actions. At each action, the environment responds and generates a new state by making a decision. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 11.  MDP contains a tuples of four elements (S, A, Pa, Ra):  A set of finite States S  A set of finite Actions A  Rewards received after transitioning from state S to state S', due to action a.  Probability Pa.  MDP uses Markov property which states that "If the agent is present in the current state S1, and by performing some action a1 moves to the state s2, then the state transition from S1to S2depends only on the current state and future action. Next state does not depend on past actions, rewards, or states."  Alternatively, the current state transition does not depend on any past action or state. Hence, MDP in an RL problems such as Chess game, the players that satisfies the Markov property, only focus on the current state and do not need to remember past actions or states. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 12.  In RL, we consider only the finite MDP, as there are finite states, finite rewards, and finite actions.  Therefore, Markov Process is a memory less process with a sequence of random states S1, S2, ....., St that uses the Markov Property. Markov process is also known as Markov chain, which is a tuples (S, P) on state S and transition function P. These two components (S and P) can define the dynamics of the s 2. Q-Learning :  The Q stands for quality in Q-learning, which means it specifies the quality of an action taken by the agent. It is a value-based method of supplying information to inform which action an agent should take. • Q-learning algorithm is used for the temporal difference Learning. The temporal difference learning methods are the way of comparing temporally successive predictions. • It learns the value function Q (S, a), which means how good to take action "a" at a particular state "s." • The below flowchart explains the working of Q- learning: Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 13. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 14.  A Q-table or matrix is created while performing the Q-learning. The table follows the state and action pair, i.e., [s, a], and initializes the values to zero. After each action, the table is updated, and the q-values are stored within the table.  The RL agent uses this Q-table as a reference table to select the best action based on the q-values.  To understand the concept, consider an example of a maze environment that the agent needs to explore as shown in following figure: Fig.: Maze Game Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 15.  In the above image consider, the agent (Blue ball) is at the very first block (S9) of the maze. The maze consisting of S6 block, which is a wall (shown by Red box), S8 a pit, and S4 a target block (where agent has to reach).  The agent cannot cross the S6 block, as it is a solid wall. If the agent reaches the S4 block, then get the +1 reward; if it reaches the pit, then gets -1 reward point. It can take four actions: move up, move down, move left, and move right.  The agent can take any path to reach to the final point, but he needs to make it in possible fewer steps. Suppose the agent considers the path S9- S5-S1-S2-S3, so he will get the +1 -reward point.  The agent will try to remember the preceding steps that it has taken to reach the final step. To memorize the steps, it assigns 1 value to each previous step: Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 16. Fig.: Maze Game Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 17.  Now, the agent has successfully stored the previous steps assigning the 1 value to each previous block. But what will the agent do if he starts moving from the block, which has 1 value block on both sides? As shown in following diagram: Fig.: Maze Game Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 18.  It will be a difficult condition for the agent whether he should go up or down as each block has the same value. So, the above approach is not suitable for the agent to reach the destination. Hence to solve the problem, we will use the Bellman equation, which is the main concept behind reinforcement learning.  To perform any action, the agent will get a reward R(s, a), and also he will end up on a certain state, so the Q -value equation is given by Bellman equation which is: V(s) = max [R(s,a) + γV(s`)] Where, V(s)= value calculated at a particular point. R(s,a) = Reward at a particular state, s by performing an action. γ = Discount factor V(s`) = The value at the previous state. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 19. The key-elements used in Bellman equations are:  Action performed by the agent is referred to as "a"  State occurred by performing the action is "s."  The reward/feedback obtained for each good and bad action is "R."  A discount factor is Gamma "γ."  In the above equation, we are taking the max of the complete values because the agent tries to find the optimal solution always.  So now, using the Bellman equation, we will find value at each state of the given environment. We will start from the block, which is next to the target block. For 1st block: V(s3) = max [R(s,a) + γV(s`)], here V(s')= 0 because there is no further state to move. V(s3)= max[R(s,a)]=> V(s3)= max[1]=> V(s3)= 1. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 20. For 2nd block: V(s2) = max [R(s,a) + γV(s`)], here γ= 0.9(assumed), V(s')= 1, and R(s, a)= 0, because there is no reward at this state. V(s2)= max[0.9(1)]=> V(s)= max[0.9]=> V(s2) =0.9 For 3rd block: V(s1) = max [R(s,a) + γV(s`)], here γ= 0.9(assumed), V(s')= 0.9, and R(s, a)= 0, because there is no reward at this state also. V(s1)= max[0.9(0.9)]=> V(s3)= max[0.81]=> V(s1) =0.81 For 4th block: V(s5) = max [R(s,a) + γV(s`)], here γ= 0.9(assumed), V(s')= 0.81, and R(s, a)= 0, because there is no reward at this state also. V(s5)= max[0.9(0.81)]=> V(s5)= max[0.81]=> V(s5) =0.73 For 5th block: V(s9) = max [R(s,a) + γV(s`)], here γ= 0.9(assumed), V(s')= 0.73, and R(s, a)= 0, because there is no reward at this state also. V(s9)= max[0.9(0.73)]=> V(s4)= max[0.81]=> V(s4) =0.66 Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 21.  Now, Consider the below image of maze environment:  Now, we will move further to the S7 block, and here agent may change the route because it always tries to find the optimal path. Agent has three options to move; if he moves to the red box, then he will feel a bump if he moves to the pit, then he will get the -1 reward. But here we are taking only positive rewards, so for this, he will move to upwards only and then to target box i. e. S4 and thus finally achieve the target. The complete block values will be calculated using this formula as shown in following figure Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 22. Fig.: Maze Game Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 23. Thompson Sampling :  Thompson Sampling is an algorithm that follows exploration and exploitation to maximize the cumulative rewards obtained by performing an action. Thompson Sampling is also sometimes referred to as Posterior Sampling or Probability Matching.  An action is performed multiple times which is called exploration and based on the results obtained from the actions, either rewards or penalties, further actions are performed with the goal to maximize the reward which is called exploitation. In other words, new choices are explored to maximize rewards while exploiting the already explored choices.  One of the first and the best examples to explain the Thompson Sampling method was the Multi-Armed Bandit problem Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 24. Steps of the algorithm: Steps are by considering our own version of a multi-armed bandit problem. Here suppose we have a dataset of online advertising campaign where an advertise has 10 different versions. Now our task is to find the best advertise using the Thompson sampling algorithm. Step 1. At each round n, we consider two numbers for each advertise i : N’i(n): Number of times the advertise i got reward 1 up to round n, N”i(n): Number of times the advertise i got reward 0 up to round n, Step 2.For each advertise i, take a random draw from the distribution below: θi(n) = β(N’i(n) + 1 , N”i(n)+1) Step 3. We select the advertise that has the maximum θi(n) At the first step, the total number of rewards(either 0 or 1) is counted. To understand the second step, we need to study at the Bayesian inference rules, as follows: Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 25. Bayesian Inference Rule:  Advertise i gets reward y from Bernoulli distribution p( θi|y) ≈ β(θi)  θi is unknown but set its uncertainty by assuming it has a uniform distribution p(θi) ≈ u([0,1]), which is the prior distribution.  Bayes Rule: Approach θi by the posterior distribution:  We get p( θi|y) ≈ β(Number of successes +1 , Number of failures + 1)  At each round n, we take a random draw θi(n) from this posterior distribution p( θi|y), for each advertise i.  At each round n we select the advertise i that has the highest θi(n). Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 26. Applications Of Thompson Sampling: Thompson Sampling algorithm has been around for a long time. It was first proposed in 1933 by William R. Thompson. Though the algorithm has been ignored and has had the least attention for many years after its proposal, it has recently gained traction in many areas of Artificial Intelligence. Some of the areas where it has been used are revenue management, marketing, web site optimization, A/B testing, advertisement, recommendation system, gaming and more. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 27.  Unlike Thompson sampling which is a probabilistic algorithm meaning that the success rate distribution of the bandits was calculated based on the probability distribution. UCB is a deterministic algorithm meaning that there is no factor of uncertainty or probability.  UCB is a deterministic algorithm for Reinforcement Learning that focuses on exploration and exploitation based on a confidence boundary that the algorithm assigns to each machine on each round of exploration. (A round is when a player pulls the arm of a machine) Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 28. Consider there are 5 bandits or slot machines namely B1, B2, B3, B4 and B5. Given the 5 machines, using UCB we are going to devise a sequence of playing the machines in a way to maximize the yield or rewards from the machines. Given below are the steps behind UCB for maximizing the rewards in a MABP: Step 1: Each machine is assumed to have a uniform Confidence Interval and a success distribution. This Confidence Interval is a margin of success rate distributions which is the most certain to consist of the actual success rate distribution of each machine which we are unaware of in the beginning. Step 2: A machine is randomly chosen to play, as initially, they have all the same confidence Intervals. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 29. Step 3: Based on whether the machine gave a reward or not, the Confidence Interval shifts either towards or away from the actual success distribution and the also converges or shrinks as it has been explored thus resulting in the Upper bound value of the confidence Interval to also be reduced. Step 4: Based on the current Upper Confidence bounds of each of the machines, the one with the highest is chosen to explore in the next round. Step 5: Steps 3 and 4 are continued until there are sufficient observations to determine the upper confidence bound of each machine. The one with the highest upper confidence bound is the machine with the highest success rate. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 30. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune Given below is the algorithm inside UCB that updates the Confidence bounds of each machine after each round. Step 1: Two values are considered for each round of exploration of a machine The number of times each machine has been selected till round n The sum of rewards collected by each machine till round n Step 2: At each round, we compute the average reward and the confidence interval of the machine i up to n rounds as follows: Average reward :
  • 31. Step 3: The machine with the maximum UCB is selected. UCB: Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune Confidence Interval :
  • 32. Difference between Reinforcement Learning and Supervised Learning : Reinforcement Learning Supervised Learning RL works by interacting with the environment. Supervised learning works on the existing dataset. The RL algorithm works like the human brain works when making some decisions. Supervised Learning works as when a human learns things in the supervision of a guide. There is no labeled dataset is present The labeled dataset is present. No previous training is provided to the learning agent. Training is provided to the algorithm so that it can predict the output. RL helps to take decisions sequentially. In Supervised learning, decisions are made when input is given. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
  • 33. Thanks !!! Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune