2. What is Reinforcement Learning?
Reinforcement Learning is a feedback-based Machine learning technique in
which an agent learns to behave in an environment by performing the
actions and seeing the results of actions. For each good action, the agent
gets positive feedback, and for each bad action, the agent gets negative
feedback or penalty.
In Reinforcement Learning, the agent learns automatically using feedbacks
without any labeled data, unlike supervised learning.
Since there is no labeled data, so the agent is bound to learn by its
experience only.
RL solves a specific type of problem where decision making is sequential,
and the goal is long-term, such as game-playing, robotics, etc.
The agent interacts with the environment and explores it by itself. The
primary goal of an agent in reinforcement learning is to improve the
performance by getting the maximum positive rewards.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
3. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
The agent learns with the process of hit and trial, and based on the
experience, it learns to perform the task in a better way. Hence, we can say
that "Reinforcement learning is a type of machine learning method where
an intelligent agent (computer program) interacts with the environment
and learns to act within that." How a Robotic dog learns the movement of
his arms is an example of Reinforcement learning.
It is a core part of Artificial intelligence, and all AI agent works on the concept
of reinforcement learning. Here we do not need to pre-program the agent, as it
learns from its own experience without any human intervention.
Example: Suppose there is an AI agent present within a maze environment,
and his goal is to find the diamond. The agent interacts with the environment
by performing some actions, and based on those actions, the state of the
agent gets changed, and it also receives a reward or penalty as feedback.
4. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
The agent continues doing these three things (take action, change
state/remain in the same state, and get feedback), and by doing these
actions, he learns and explores the environment.
The agent learns that what actions lead to positive feedback or rewards and
what actions lead to negative feedback penalty. As a positive reward, the
agent gets a positive point, and as a penalty, it gets a negative point.
5. Terms used in Reinforcement Learning
Agent(): An entity that can perceive/explore the environment and act upon it.
Environment(): A situation in which an agent is present or surrounded by. In RL,
we assume the stochastic environment, which means it is random in nature.
Action(): Actions are the moves taken by an agent within the environment.
State(): State is a situation returned by the environment after each action taken
by the agent.
Reward(): A feedback returned to the agent from the environment to evaluate the
action of the agent.
Policy(): Policy is a strategy applied by the agent for the next action based on
the current state.
Value(): It is expected long-term retuned with the discount factor and opposite to
the short-term reward.
Q-value(): It is mostly similar to the value, but it takes one additional parameter
as a current action (a).
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
6. Types of Reinforcement learning :
There are mainly two types of reinforcement learning, which are:
Positive Reinforcement:
The positive reinforcement learning means adding something to increase the
tendency that expected behavior would occur again. It impacts positively on
the behavior of the agent and increases the strength of the behavior.
This type of reinforcement can sustain the changes for a long time, but too
much positive reinforcement may lead to an overload of states that can
reduce the consequences.
Negative Reinforcement:
The negative reinforcement learning is opposite to the positive reinforcement
as it increases the tendency that the specific behavior will occur again by
avoiding the negative condition.
It can be more effective than the positive reinforcement depending on
situation and behavior, but it provides reinforcement only to meet minimum
behavior.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
7. RL is mainly used to train Robots for industrial automation.
Business strategy planning are decided with RL. Bonsai is one of several
startups building tools to enable companies to use RL and other techniques
for industrial applications.
Education and training: Online platforms are beginning to experiment with
machine learning to create personalized experiences. Several researchers
are exploring the use of RL and other machine learning methods in tutoring
systems and personalized learning. The use of RL can lead to training
systems that provide custom instruction and materials tuned to the needs of
individual students. A group of researchers are developing RL algorithms
and statistical methods that require less data for use in future tutoring
systems.
Machine learning and data processing:AutoML from Google uses RL to
produce state-of-the-art machine-generated neural network architectures for
computer vision and language modelling.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
8. Health and medicine: The RL setup of an agent interacting with an
environment receiving feedback based on actions taken, shares similarities
with the problem of learning treatment policies in the medical sciences. In
fact, many RL applications in health care mostly relate to finding optimal
treatment policies. Recent papers mentioned applications of RL to usage of
medical equipment, medication dosing, and two-stage clinical trials.
Aircraft control and robot motion control
Text, speech, and dialog systems: Companies collect a lot of text, and good
tools that can help unlock unstructured text will find users. AI researchers at
SalesForce used deep RL for abstractive text summarization (a technique
for automatically generating summaries from text based on content
“abstracted” from some original text document). This could be an area where
RL-based tools gain new users, as many companies are in need of better
text mining solutions.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
9. RL is also being used to allow dialog systems (i.e., chatbots) to learn from
user interactions and thus help them improve over time (many enterprise
chatbots currently rely on decision trees). This is an active area of research
and VC investments: see Semantic Machines and VocalIQ—acquired by
Apple.
Media and advertising: Microsoft recently described an internal system
called Decision Service that has since been made available on Azure. This
paper describes applications of Decision Service to content recommendation
and advertising. Decision Service more generally targets machine learning
products that suffer from failure modes including “feedback loops and bias,
distributed data collection, changes in the environment, and weak monitoring
and debugging.”
Other applications of RL include cross-channel marketing optimization and
real time bidding systems for online display advertising.
Finance: A Financial Times article pronounced an RL-based system for
optimal trade execution. The system (dubbed “LOXM”) is being used to
perform trading orders at maximum speed and at the best possible price.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
10. There are two important learning models in reinforcement learning:
1. Markov Decision Process (MDP):
Most of the reinforcement learning problems can be formalized using MDP.
In MDP, the agent constantly interacts with the environment tand performs
some actions. At each action, the environment responds and generates a
new state by making a decision.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
11. MDP contains a tuples of four elements (S, A, Pa, Ra):
A set of finite States S
A set of finite Actions A
Rewards received after transitioning from state S to state S', due to action a.
Probability Pa.
MDP uses Markov property which states that "If the agent is present in the
current state S1, and by performing some action a1 moves to the state s2,
then the state transition from S1to S2depends only on the current state and
future action. Next state does not depend on past actions, rewards, or
states."
Alternatively, the current state transition does not depend on any past action
or state. Hence, MDP in an RL problems such as Chess game, the players
that satisfies the Markov property, only focus on the current state and do not
need to remember past actions or states.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
12. In RL, we consider only the finite MDP, as there are finite states, finite rewards,
and finite actions.
Therefore, Markov Process is a memory less process with a sequence of random
states S1, S2, ....., St that uses the Markov Property. Markov process is also
known as Markov chain, which is a tuples (S, P) on state S and transition function
P. These two components (S and P) can define the dynamics of the s
2. Q-Learning :
The Q stands for quality in Q-learning, which means it specifies the quality of an
action taken by the agent. It is a value-based method of supplying information to
inform which action an agent should take.
• Q-learning algorithm is used for the temporal difference Learning. The
temporal difference learning methods are the way of comparing temporally
successive predictions.
• It learns the value function Q (S, a), which means how good to take action "a"
at a particular state "s."
• The below flowchart explains the working of Q- learning:
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
14. A Q-table or matrix is created while performing the Q-learning. The table
follows the state and action pair, i.e., [s, a], and initializes the values to zero.
After each action, the table is updated, and the q-values are stored within
the table.
The RL agent uses this Q-table as a reference table to select the best action
based on the q-values.
To understand the concept, consider an example of a maze environment
that the agent needs to explore as shown in following figure:
Fig.: Maze Game
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
15. In the above image consider, the agent (Blue ball) is at the very first block
(S9) of the maze. The maze consisting of S6 block, which is a wall (shown by
Red box), S8 a pit, and S4 a target block (where agent has to reach).
The agent cannot cross the S6 block, as it is a solid wall. If the agent
reaches the S4 block, then get the +1 reward; if it reaches the pit, then
gets -1 reward point. It can take four actions: move up, move down, move
left, and move right.
The agent can take any path to reach to the final point, but he needs to
make it in possible fewer steps. Suppose the agent considers the path S9-
S5-S1-S2-S3, so he will get the +1 -reward point.
The agent will try to remember the preceding steps that it has taken to reach
the final step. To memorize the steps, it assigns 1 value to each previous
step:
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
17. Now, the agent has successfully stored the previous steps assigning the 1
value to each previous block. But what will the agent do if he starts moving
from the block, which has 1 value block on both sides? As shown in
following diagram:
Fig.: Maze Game
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
18. It will be a difficult condition for the agent whether he should go up or down
as each block has the same value. So, the above approach is not suitable
for the agent to reach the destination. Hence to solve the problem, we will
use the Bellman equation, which is the main concept behind reinforcement
learning.
To perform any action, the agent will get a reward R(s, a), and also he will
end up on a certain state, so the Q -value equation is given by Bellman
equation which is:
V(s) = max [R(s,a) + γV(s`)]
Where,
V(s)= value calculated at a particular point.
R(s,a) = Reward at a particular state, s by performing an action.
γ = Discount factor
V(s`) = The value at the previous state.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
19. The key-elements used in Bellman equations are:
Action performed by the agent is referred to as "a"
State occurred by performing the action is "s."
The reward/feedback obtained for each good and bad action is "R."
A discount factor is Gamma "γ."
In the above equation, we are taking the max of the complete values
because the agent tries to find the optimal solution always.
So now, using the Bellman equation, we will find value at each state of the
given environment. We will start from the block, which is next to the target
block.
For 1st block:
V(s3) = max [R(s,a) + γV(s`)], here V(s')= 0 because there is no further state
to move.
V(s3)= max[R(s,a)]=> V(s3)= max[1]=> V(s3)= 1.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
20. For 2nd block:
V(s2) = max [R(s,a) + γV(s`)], here γ= 0.9(assumed), V(s')= 1, and R(s, a)= 0,
because there is no reward at this state.
V(s2)= max[0.9(1)]=> V(s)= max[0.9]=> V(s2) =0.9
For 3rd block:
V(s1) = max [R(s,a) + γV(s`)], here γ= 0.9(assumed), V(s')= 0.9, and R(s, a)= 0,
because there is no reward at this state also.
V(s1)= max[0.9(0.9)]=> V(s3)= max[0.81]=> V(s1) =0.81
For 4th block:
V(s5) = max [R(s,a) + γV(s`)], here γ= 0.9(assumed), V(s')= 0.81, and R(s, a)=
0, because there is no reward at this state also.
V(s5)= max[0.9(0.81)]=> V(s5)= max[0.81]=> V(s5) =0.73
For 5th block:
V(s9) = max [R(s,a) + γV(s`)], here γ= 0.9(assumed), V(s')= 0.73, and R(s, a)=
0, because there is no reward at this state also.
V(s9)= max[0.9(0.73)]=> V(s4)= max[0.81]=> V(s4) =0.66
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
21. Now, Consider the below image of maze environment:
Now, we will move further to the S7 block, and here agent may change the
route because it always tries to find the optimal path. Agent has three
options to move; if he moves to the red box, then he will feel a bump if he
moves to the pit, then he will get the -1 reward. But here we are taking only
positive rewards, so for this, he will move to upwards only and then to target
box i. e. S4 and thus finally achieve the target. The complete block values
will be calculated using this formula as shown in following figure
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
23. Thompson Sampling :
Thompson Sampling is an algorithm that follows exploration and exploitation to
maximize the cumulative rewards obtained by performing an action. Thompson
Sampling is also sometimes referred to as Posterior Sampling or Probability
Matching.
An action is performed multiple times which is called exploration and based on
the results obtained from the actions, either rewards or penalties, further
actions are performed with the goal to maximize the reward which is called
exploitation. In other words, new choices are explored to maximize rewards
while exploiting the already explored choices.
One of the first and the best examples to explain the Thompson Sampling
method was the Multi-Armed Bandit problem
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
24. Steps of the algorithm:
Steps are by considering our own version of a multi-armed bandit problem.
Here suppose we have a dataset of online advertising campaign where an
advertise has 10 different versions. Now our task is to find the best advertise
using the Thompson sampling algorithm.
Step 1. At each round n, we consider two numbers for each advertise i :
N’i(n): Number of times the advertise i got reward 1 up to round n,
N”i(n): Number of times the advertise i got reward 0 up to round n,
Step 2.For each advertise i, take a random draw from the distribution below:
θi(n) = β(N’i(n) + 1 , N”i(n)+1)
Step 3. We select the advertise that has the maximum θi(n)
At the first step, the total number of rewards(either 0 or 1) is counted.
To understand the second step, we need to study at the Bayesian inference
rules, as follows:
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
25. Bayesian Inference Rule:
Advertise i gets reward y from Bernoulli distribution p( θi|y) ≈ β(θi)
θi is unknown but set its uncertainty by assuming it has a uniform distribution
p(θi) ≈ u([0,1]), which is the prior distribution.
Bayes Rule: Approach θi by the posterior distribution:
We get p( θi|y) ≈ β(Number of successes +1 , Number of failures + 1)
At each round n, we take a random draw θi(n) from this posterior distribution
p( θi|y), for each advertise i.
At each round n we select the advertise i that has the highest θi(n).
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
26. Applications Of Thompson Sampling:
Thompson Sampling algorithm has been around for a long time. It was first
proposed in 1933 by William R. Thompson. Though the algorithm has been
ignored and has had the least attention for many years after its proposal, it
has recently gained traction in many areas of Artificial Intelligence. Some of
the areas where it has been used are revenue management, marketing, web
site optimization, A/B testing, advertisement, recommendation system,
gaming and more.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
27. Unlike Thompson sampling which is a probabilistic algorithm
meaning that the success rate distribution of the bandits was
calculated based on the probability distribution. UCB is a
deterministic algorithm meaning that there is no factor of uncertainty
or probability.
UCB is a deterministic algorithm for Reinforcement Learning that
focuses on exploration and exploitation based on a confidence
boundary that the algorithm assigns to each machine on each round
of exploration. (A round is when a player pulls the arm of a machine)
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
28. Consider there are 5 bandits or slot machines namely B1, B2, B3, B4 and B5.
Given the 5 machines, using UCB we are going to devise a sequence of playing
the machines in a way to maximize the yield or rewards from the machines.
Given below are the steps behind UCB for maximizing the rewards in a MABP:
Step 1: Each machine is assumed to have a uniform Confidence Interval and a
success distribution. This Confidence Interval is a margin of success rate
distributions which is the most certain to consist of the actual success rate
distribution of each machine which we are unaware of in the beginning.
Step 2: A machine is randomly chosen to play, as initially, they have all the same
confidence Intervals.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
29. Step 3: Based on whether the machine gave a reward or not, the Confidence
Interval shifts either towards or away from the actual success distribution
and the also converges or shrinks as it has been explored thus resulting in
the Upper bound value of the confidence Interval to also be reduced.
Step 4: Based on the current Upper Confidence bounds of each of the
machines, the one with the highest is chosen to explore in the next round.
Step 5: Steps 3 and 4 are continued until there are sufficient observations to
determine the upper confidence bound of each machine. The one with the
highest upper confidence bound is the machine with the highest success
rate.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
30. Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
Given below is the algorithm inside UCB that updates the Confidence bounds
of each machine after each round.
Step 1: Two values are considered for each round of exploration of a machine
The number of times each machine has been selected till round n
The sum of rewards collected by each machine till round n
Step 2: At each round, we compute the average reward and the confidence
interval of the machine i up to n rounds as follows:
Average reward :
31. Step 3: The machine with the maximum UCB is selected.
UCB:
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
Confidence Interval :
32. Difference between Reinforcement Learning and Supervised Learning :
Reinforcement Learning Supervised Learning
RL works by interacting with the
environment.
Supervised learning works on the
existing dataset.
The RL algorithm works like the
human brain works when making
some decisions.
Supervised Learning works as when a
human learns things in the
supervision of a guide.
There is no labeled dataset is present The labeled dataset is present.
No previous training is provided to the
learning agent.
Training is provided to the algorithm
so that it can predict the output.
RL helps to take decisions
sequentially.
In Supervised learning, decisions are
made when input is given.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune