SlideShare a Scribd company logo
1 of 31
Download to read offline
REINFORCEMENT LEARNING (RL)
SEMINAR REPORT
Submitted by
KALAISRIRAM.S
Reg.No:17TH1222
In partial fulfillment of the requirement for the degree of
BACHELOR OF TECHNOLOGY
in
INFORMATION TECHNOLOGY
of
PONDICHERRY UNIVERSITY
DEPARTMENT OF INFORMATION TECHNOLOGY
MANAKULA VINAYAGAR INSTITUTE OF TECHNOLOGY
KALITHEERTHALKUPPAM, PUDUCHERRY- 605 107.
MARCH-21.
MANAKULA VINAYAGAR INSTITUTE OF
TECHNOLOGY
KALITHEERTHAL KUPPAM, PUDUCHERRY- 605 107
DEPARTMENT OF INFORMATION TECHNOLOGY
BONAFIDE CERTIFICATE
This is to Certify that the Seminar Report Presentation Titled “REENFORCEMENT
LEARNING” is a bonafide record work done by S.KALAISRIRAM (Reg. No:17TH1222)
of Seventh Semester B.TECH in INFORMATION TECHNOLOGY for the SEMINAR
REPORT during the academic year 2020-21.
Staff in charge Head of the Department
Table of Contents
INTRODUCTION.......................................................................................................................................1
PROCESS OF REINFORCEMENT LEARNING ..................................................................................4
THE BELLMAN EQUATION ..................................................................................................................9
TYPES OF REINFORCEMENT LEARNING......................................................................................10
MARKOV DECISION PROCESS.............................................................................................................11
How to represent the agent state? .......................................................................................................11
1. Markov Decision.............................................................................................................................11
2. Markov Property.........................................................................................................................12
3. Markov Process:..........................................................................................................................12
VARIOUS REINFORCEMENT LEARNING ALGORITHMS..........................................................13
COMPARISON OF REINFORCEMENT LEARNING ALGORITHMS ................................................18
1. Associative reinforcement learning ............................................................................................18
2. Deep reinforcement learning.......................................................................................................18
3. Inverse reinforcement learning ...................................................................................................18
4. Safe Reinforcement Learning .....................................................................................................18
LEARNING MEATHODS.......................................................................................................................20
Difference between Reinforcement learning and Supervised learning:...........................................21
APPLICATIONS ......................................................................................................................................23
Reinforcement Learning Applications....................................................................................................23
PROS AND CONS....................................................................................................................................25
Pros of Reinforcement Learning.............................................................................................................25
Cons of Reinforcement Learning............................................................................................................26
CONCLUSION .........................................................................................................................................28
REFERENCES: ....................................................................................................................................28
1
CHAPTER 1
INTRODUCTION
1. What is Reinforcement learning?
• Reinforcement Learning is a feedback-based Machine learning
technique in which an agent learns to behave in an environment by
performing the actions and seeing the results of actions. For each
good action, the agent gets positive feedback, and for each bad action,
the agent gets negative feedback or penalty.
• In Reinforcement Learning, the agent learns automatically using
feedbacks without any labeled data, unlike supervised learning. Since
there is no labeled data, so the agent is bound to learn by its
experience only.
• RL solves a specific type of problem where decision making is
sequential, and the goal is long-term, such as game-playing,
robotics, etc.The agent interacts with the environment and explores it
by itself.
• The primary goal of an agent in reinforcement learning is to improve
the performance by getting the maximum positive rewards.
• The agent learns with the process of hit and trial, and based on the
experience, it learns to perform the task in a better way.
• Hence, we can say that "Reinforcement learning is a type of
machine learning method where an intelligent agent (computer
program) interacts with the environment and learns to act within
that."
.
2
FIG 1
FIG 2
3
2. Terms used in Reinforcement Learning
• Agent(): An entity that can perceive/explore the environment and
act upon it.
• Environment(): A situation in which an agent is present or
surrounded by. In RL, we assume the stochastic environment,
which means it is random in nature.
• Action(): Actions are the moves taken by an agent within the
environment.
• State(): State is a situation returned by the environment after each
action taken by the agent.
• Reward(): A feedback returned to the agent from the environment
to evaluate the action of the agent.
• Policy(): Policy is a strategy applied by the agent for the next
action based on the current state.
• Value(): It is expected long-term retuned with the discount factor
and opposite to the short-term reward.
• Q-value(): It is mostly similar to the value, but it takes one
additional parameter as a current action (a).
3. Key Features of Reinforcement Learning
o In RL, the agent is not instructed about the environment and what
actions need to be taken.
o It is based on the hit and trial process.
o The agent takes the next action and changes states according to the
feedback of the previous action.
o The agent may get a delayed reward.
o The environment is stochastic, and the agent needs to explore it to
reach to get the maximum positive rewards
4
CHAPTER 2
PROCESS OF REINFORCEMENT LEARNING
1. Elements of Reinforcement Learning
There are four main elements of Reinforcement Learning, which are
given below:
1) Policy: A policy can be defined as a way how an agent behaves at a
given time. It maps the perceived states of the environment to the actions
taken on those states. A policy is the core element of the RL as it alone
can define the behavior of the agent. In some cases, it may be a simple
function or a lookup table, whereas, for other cases, it may involve
general computation as a search process. It could be deterministic or a
stochastic policy:
For deterministic policy: a = π(s)
For stochastic policy: π(a | s) = P[At =a | St = s]
2) Reward Signal: The goal of reinforcement learning is defined by the
reward signal. At each state, the environment sends an immediate signal
to the learning agent, and this signal is known as a reward signal. These
rewards are given according to the good and bad actions taken by the
agent. The agent's main objective is to maximize the total number of
rewards for good actions. The reward signal can change the policy, such
as if an action selected by the agent leads to low reward, then the policy
may change to select other actions in the future.
3) Value Function: The value function gives information about how
good the situation and action are and how much reward an agent can
expect. A reward indicates the immediate signal for each good and
bad action, whereas a value function specifies the good state and
action for the future. The value function depends on the reward as,
without reward, there could be no value. The goal of estimating values is
to achieve more rewards.
5
4) Model: The last element of reinforcement learning is the model,
which mimics the behavior of the environment. With the help of the
model, one can make inferences about how the environment will behave.
Such as, if a state and an action are given, then a model can predict the
next state and reward.
2. Approach for implementation
There are mainly three ways to implement reinforcement-learning in
ML, which are:
1. Value-based:
The value-based approach is about to find the optimal value
function, which is the maximum value at a state under any policy.
Therefore, the agent expects the long-term return at any state(s)
under policy π.
2. Policy-based:
Policy-based approach is to find the optimal policy for the
maximum future rewards without using the value function. In this
approach, the agent tries to apply such a policy that the action
performed in each step helps to maximize the future reward.
The policy-based approach has mainly two types of policy:
o Deterministic: The same action is produced by the policy (π)
at any state.
o Stochastic: In this policy, probability determines the
produced action.
3. Model-based: In the model-based approach, a virtual model is
created for the environment, and the agent explores that
environment to learn it. There is no particular solution or algorithm
for this approach because the model representation is different for
each environment
6
3. Reinforcement Learning Working
To understand the working process of the RL, we need to consider two
main things:
1 Environment: It can be anything such as a room, maze, football
ground, etc.
2 Agent: An intelligent agent such as AI robot.
Let's take an example of a maze environment that the agent needs to
explore. Consider the below image:
FIG 3
In the above image, the agent is at the very first block of the maze. The
maze is consisting of an S6 block, which is a wall, S8 a fire pit, and
S4 a diamond block.
The agent cannot cross the S6 block, as it is a solid wall. If the agent
reaches the S4 block, then get the +1 reward; if it reaches the fire pit,
7
then gets -1 reward point. It can take four actions: move up, move
down, move left, and move right.
The agent can take any path to reach to the final point, but he needs to
make it in possible fewer steps. Suppose the agent considers the path S9-
S5-S1-S2-S3, so he will get the +1-reward point.
The agent will try to remember the preceding steps that it has taken to
reach the final step. To memorize the steps, it assigns 1 value to each
previous step. Consider the below step:
FIG 4
Now, the agent has successfully stored the previous steps assigning the 1
value to each previous block. But what will the agent do if he starts
moving from the block, which has 1 value block on both sides? Consider
the below diagram:
8
FIG5
It will be a difficult condition for the agent whether he should go up or
down as each block has the same value. So, the above approach is not
suitable for the agent to reach the destination. Hence to solve the
problem, we will use the Bellman equation, which is the main concept
behind reinforcement learning.
9
CHAPTER 3
THE BELLMAN EQUATION
The Bellman equation was introduced by the Mathematician Richard
Ernest Bellman in the year 1953, and hence it is called as a Bellman
equation. It is associated with dynamic programming and used to
calculate the values of a decision problem at a certain point by including
the values of previous states.
It is a way of calculating the value functions in dynamic programming or
environment that leads to modern reinforcement learning.
The key-elements used in Bellman equations are:
o Action performed by the agent is referred to as "a"
o State occurred by performing the action is "s."
o The reward/feedback obtained for each good and bad action is "R."
o A discount factor is Gamma "γ."
The Bellman equation can be written as:
V(s) = max [R(s,a) + γV(s`)]
Where,
V(s)= value calculated at a particular point.
R(s,a) = Reward at a particular state s by performing an action.
γ = Discount factor
V(s`) = The value at the previous state.
10
CHAPTER 4
TYPES OF REINFORCEMENT LEARNING
There are mainly two types of reinforcement learning, which are:
o Positive Reinforcement
o Negative Reinforcement
Positive Reinforcement:
The positive reinforcement learning means adding something to increase
the tendency that expected behavior would occur again. It impacts
positively on the behavior of the agent and increases the strength of the
behavior.
This type of reinforcement can sustain the changes for a long time, but
too much positive reinforcement may lead to an overload of states that
can reduce the consequences.
Negative Reinforcement:
The negative reinforcement learning is opposite to the positive
reinforcement as it increases the tendency that the specific behavior will
occur again by avoiding the negative condition.
It can be more effective than the positive reinforcement depending on
situation and behavior, but it provides reinforcement only to meet
minimum behavior.
11
CHAPTER 5
MARKOV DECISION PROCESS
How to represent the agent state?
We can represent the agent state using the Markov State that contains all the
required information from the history. The State St is Markov state if it follows the
given condition:
P[St+1 | St ] = P[St +1 | S1,......, St]
The Markov state follows the Markov property, which says that the future is
independent of the past and can only be defined with the present. The RL works on
fully observable environments, where the agent can observe the environment and
act for the new state. The complete process is known as Markov Decision process
Markov Decision
Markov Decision Process or MDP, is used to formalize the reinforcement
learning problems. If the environment is completely observable, then its dynamic
can be modeled as a Markov Process. In MDP, the agent constantly interacts with
the environment and performs actions; at each action, the environment responds
and generates a new state.
FIG 6
12
MDP is used to describe the environment for the RL, and almost all the RL
problem can be formalized using MDP.
MDP contains a tuple of four elements (S, A, Pa, Ra):
o A set of finite States S
o A set of finite Actions A
o Rewards received after transitioning from state S to state S', due to action a.
o Probability Pa.
MDP uses Markov property, and to better understand the MDP, we need to learn
about it.
Markov Property
It says that "If the agent is present in the current state S1, performs an action a1
and move to the state s2, then the state transition from s1 to s2 only depends on the
current state and future action and states do not depend on past actions, rewards,
or states."
Or, in other words, as per Markov Property, the current state transition does not
depend on any past action or state. Hence, MDP is an RL problem that satisfies the
Markov property. Such as in a Chess game, the players only focus on the
current state and do not need to remember past actions or states.
Finite MDP:
A finite MDP is when there are finite states, finite rewards, and finite actions. In
RL, we consider only the finite MDP.
Markov Process:
Markov Process is a memoryless process with a sequence of random states S1, S2,
....., St that uses the Markov Property. Markov process is also known as Markov
chain, which is a tuple (S, P) on state S and transition function P. These two
components (S and P) can define the dynamics of the system.
13
CHAPTER 6
VARIOUS REINFORCEMENT LEARNING ALGORITHMS
Reinforcement learning algorithms are mainly used in AI applications
and gaming applications. The main used algorithms are:
1. Q-Learning:
a. Q-learning is an Off policy RL algorithm, which is used for
the temporal difference Learning. The temporal difference
learning methods are the way of comparing temporally
successive predictions.
b. It learns the value function Q (S, a), which means how good to
take action "a" at a particular state "s."
c. The below flowchart explains the working of Q- learning:
FIG 7
14
2. State Action Reward State action (SARSA)
o SARSA stands for State Action Reward State action, which is
an on-policy temporal difference learning method. The on-policy
control method selects the action for each state while learning
using a specific policy.
o The goal of SARSA is to calculate the Q π (s, a) for the selected
current policy π and all pairs of (s-a).
o The main difference between Q-learning and SARSA algorithms is
that unlike Q-learning, the maximum reward for the next state
is not required for updating the Q-value in the table.
o In SARSA, new action and reward are selected using the same
policy, which has determined the original action.
o The SARSA is named because it uses the quintuple Q(s, a, r, s',
a'). Where,
s: original state
a: Original action
r: reward observed while following the states
s' and a': New state, action pair.
3. Deep Q Neural Network (DQN):
• As the name suggests, DQN is a Q-learning using Neural
networks.
• For a big state space environment, it will be a challenging and
complex task to define and update a Q-table.
• To solve such an issue, we can use a DQN algorithm. Where,
instead of defining a Q-table, neural network approximates the Q-
values for each action and state
• Another two techniques are also essential for training DQN:
15
1. Experience Replay
2. Separate Target Network:
4. Deep Deterministic Policy Gradient (DDPG)
DDPG relies on the actor-critic architecture with two eponymous
elements, actor and critic. An actor is used to tune the parameter 𝜽 for
the policy function, i.e. decide the best action for a specific state.
A critic is used for evaluating the policy function estimated by the actor
according to the temporal difference (TD) error.
Here, the lower-case v denotes the policy that the actor has decided. Does
it look familiar? Yes! It looks just like the Q-learning update equation!
TD learning is a way to learn how to predict a value depending on future
values of a given state. Q-learning is a specific type of TD learning for
learning Q-value.
16
FIG 8
DDPG also borrows the ideas of experience replay and separate target
network from DQN . Another issue for DDPG is that it seldom performs
exploration for actions. A solution for this is adding noise on the
parameter space or the action space.
17
Classification Of RL Algorithm
FIG 9
18
CHAPTER 7
COMPARISON OF REINFORCEMENT LEARNING ALGORITHMS
1. Associative reinforcement learning
Associative reinforcement learning tasks combine facets of stochastic
learning automata tasks and supervised learning pattern classification
tasks. In associative reinforcement learning tasks, the learning system
interacts in a closed loop with its environment
2. Deep reinforcement learning
This approach extends reinforcement learning by using a deep neural
network and without explicitly designing the state space. The work on
learning ATARI games by Google DeepMind increased attention
to deep reinforcement learning or end-to-end reinforcement learning.
3. Inverse reinforcement learning
In inverse reinforcement learning (IRL), no reward function is given.
Instead, the reward function is inferred given an observed behavior from
an expert. The idea is to mimic observed behavior, which is often
optimal or close to optimal.
4. Safe Reinforcement Learning
Safe Reinforcement Learning (SRL) can be defined as the process of
learning policies that maximize the expectation of the return in problems
in which it is important to ensure reasonable system performance and/or
respect safety constraints during the learning and/or deployment
processes.
19
FIG 10
20
CHAPTER 8
LEARNING MEATHODS
These methods are different from previously studied methods and very
rarely used also. In this kind of learning algorithms, there would be an
agent that we want to train over a period of time so that it can interact
with a specific environment. The agent will follow a set of strategies for
interacting with the environment and then after observing the
environment it will take actions regards the current state of the
environment. The following are the main steps of reinforcement
learning methods −
1. First, we need to prepare an agent with some initial set of strategies.
2. Then observe the environment and its current state.
3. Next, select the optimal policy regards the current state of the
environment and perform important action.
4. Now, the agent can get corresponding reward or penalty as per
accordance with the action taken by it in previous step.
5. Now, we can update the strategies if it is required so.
6. At last, repeat steps 2-5 until the agent got to learn and adopt the
optimal policies.
21
Difference between Reinforcement learning and Supervised
learning:
Reinforcement learning Supervised learning
Reinforcement learning is all
about making decisions
sequentially. In simple words we
can say that the output depends on
the state of the current input and
the next input depends on the
output of the previous input
In Supervised learning the
decision is made on the initial
input or the input given at the start
In Reinforcement learning
decision is dependent, So we give
labels to sequences of dependent
decisions
Supervised learning the
decisions are independent of
each other so labels are given to
each decision.
Example: Chess game Example: Object recognition
TABLE 1
22
TABLE 2
23
CHAPTER 9
APPLICATIONS
Reinforcement Learning Applications
FIG 11
1. Robotics:
1. RL is used in Robot navigation, Robo-soccer, walking,
juggling, etc.
2. Control:
1. RL can be used for adaptive control such as Factory processes,
admission control in telecommunication, and Helicopter pilot is an
example of reinforcement learning.
3. Game Playing:
1. RL can be used in Game playing such as tic-tac-toe, chess, etc.
4. Chemistry:
1. RL can be used for optimizing the chemical reactions.
24
5. Business:
1. RL is now used for business strategy planning.
6. Manufacturing:
1. In various automobile manufacturing companies, the robots
use deep reinforcement learning to pick goods and put them
in some containers.
7. Finance Sector:
1. The RL is currently used in the finance sector for evaluating
trading strategies.
25
CHAPTER 10
PROS AND CONS
Pros of Reinforcement Learning
• Reinforcement learning can be used to solve very complex problems
that cannot be solved by conventional techniques.
• This technique is preferred to achieve long-term results, which are
very difficult to achieve.
• This learning model is very similar to the learning of human
beings. Hence, it is close to achieving perfection.
• The model can correct the errors that occurred during the training
process.
• Once an error is corrected by the model, the chances of occurring
the same error are very less.
• It can create the perfect model to solve a particular problem.
• Robots can implement reinforcement learning algorithms to learn
how to walk.
• In the absence of a training dataset, it is bound to learn from its
experience.
• Reinforcement learning models can outperform humans in many
tasks. DeepMind’s AlphaGo program, a reinforcement learning
model, beat the world champion Lee Sedol at the game of Go in
March 2016.
• Reinforcement learning is intended to achieve the ideal behavior of
a model within a specific context, to maximize its performance.
• It can be useful when the only way to collect information about the
environment is to interact with it
26
• Reinforcement learning algorithms maintain a balance between
exploration and exploitation. Exploration is the process of trying
different things to see if they are better than what has been tried
before. Exploitation is the process of trying the things that have
worked best in the past. Other learning algorithms do not perform
this balance
Cons of Reinforcement Learning
• Reinforcement learning as a framework is wrong in many different
ways, but it is precisely this quality that makes it useful.
• Too much reinforcement learning can lead to an overload of states,
which can diminish the results.
• Reinforcement learning is not preferable to use for solving simple
problems.
• Reinforcement learning needs a lot of data and a lot of computation. It
is data-hungry. That is why it works really well in video games
because one can play the game again and again and again, so getting
lots of data seems feasible.
• Reinforcement learning assumes the world is Markovian, which it is
not. The Markovian model describes a sequence of possible events in
which the probability of each event depends only on the state attained
in the previous event.
• The curse of dimensionality limits reinforcement learning heavily for
real physical systems. According to Wikipedia, the curse of
dimensionality refers to various phenomena that arise when analyzing
and organizing data in high-dimensional spaces that do not occur in
low-dimensional settings such as the three-dimensional physical space
of everyday experience.
• Another disadvantage is the curse of real-world samples. For example,
consider the case of learning by robots. The robot hardware is usually
27
very expensive, suffers from wear and tear, and requires careful
maintenance. Repairing a robot system is costs a lot.
• To solve many problems of reinforcement learning, we can use a
combination of reinforcement learning with other techniques rather
than leaving it altogether. One popular combination is Reinforcement
learning with Deep Learning.
28
CHAPTER 11
CONCLUSION
We can say that Reinforcement Learning is one of the most interesting
and useful parts of Machine learning. In RL, the agent explores the
environment by exploring it without any human intervention. It is the
main learning algorithm that is used in Artificial Intelligence. But there
are some cases where it should not be used, such as if you have enough
data to solve the problem, then other ML algorithms can be used more
efficiently. The main issue with the RL algorithm is that some of the
parameters may affect the speed of the learning, such as delayed
feedback.
REFERENCES:
1. https://www.javatpoint.com/reinforcement-learning
2. https://towardsdatascience.com/introduction-to-various-
reinforcement-learning-algorithms-i-q-learning-sarsa-dqn-
ddpg-72a5e0cb6287
3. https://en.wikipedia.org/wiki/Reinforcement_learning#Compar
ison_of_reinforcement_learning_algorithms

More Related Content

What's hot

Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningDing Li
 
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex FridmanMIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex FridmanPeerasak C.
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learningbutest
 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-LearningKuppusamy P
 
Knowledge representation In Artificial Intelligence
Knowledge representation In Artificial IntelligenceKnowledge representation In Artificial Intelligence
Knowledge representation In Artificial IntelligenceRamla Sheikh
 
Semi-Supervised Learning
Semi-Supervised LearningSemi-Supervised Learning
Semi-Supervised LearningLukas Tencer
 
Intro to Deep learning - Autoencoders
Intro to Deep learning - Autoencoders Intro to Deep learning - Autoencoders
Intro to Deep learning - Autoencoders Akash Goel
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningshivani saluja
 
Machine Learning Algorithm - Logistic Regression
Machine Learning Algorithm - Logistic RegressionMachine Learning Algorithm - Logistic Regression
Machine Learning Algorithm - Logistic RegressionKush Kulshrestha
 
Artificial Intelligence Searching Techniques
Artificial Intelligence Searching TechniquesArtificial Intelligence Searching Techniques
Artificial Intelligence Searching TechniquesDr. C.V. Suresh Babu
 
Problem solving agents
Problem solving agentsProblem solving agents
Problem solving agentsMegha Sharma
 
An introduction to reinforcement learning
An introduction to  reinforcement learningAn introduction to  reinforcement learning
An introduction to reinforcement learningJie-Han Chen
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningKhaled Saleh
 
Adversarial search
Adversarial searchAdversarial search
Adversarial searchDheerendra k
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement LearningUsman Qayyum
 
Minmax Algorithm In Artificial Intelligence slides
Minmax Algorithm In Artificial Intelligence slidesMinmax Algorithm In Artificial Intelligence slides
Minmax Algorithm In Artificial Intelligence slidesSamiaAziz4
 

What's hot (20)

Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
search strategies in artificial intelligence
search strategies in artificial intelligencesearch strategies in artificial intelligence
search strategies in artificial intelligence
 
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex FridmanMIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-Learning
 
Knowledge representation In Artificial Intelligence
Knowledge representation In Artificial IntelligenceKnowledge representation In Artificial Intelligence
Knowledge representation In Artificial Intelligence
 
Semi-Supervised Learning
Semi-Supervised LearningSemi-Supervised Learning
Semi-Supervised Learning
 
Intro to Deep learning - Autoencoders
Intro to Deep learning - Autoencoders Intro to Deep learning - Autoencoders
Intro to Deep learning - Autoencoders
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Canny Edge Detection
Canny Edge DetectionCanny Edge Detection
Canny Edge Detection
 
Machine Learning Algorithm - Logistic Regression
Machine Learning Algorithm - Logistic RegressionMachine Learning Algorithm - Logistic Regression
Machine Learning Algorithm - Logistic Regression
 
Artificial Intelligence Searching Techniques
Artificial Intelligence Searching TechniquesArtificial Intelligence Searching Techniques
Artificial Intelligence Searching Techniques
 
Problem solving agents
Problem solving agentsProblem solving agents
Problem solving agents
 
Evolutionary Computing
Evolutionary ComputingEvolutionary Computing
Evolutionary Computing
 
An introduction to reinforcement learning
An introduction to  reinforcement learningAn introduction to  reinforcement learning
An introduction to reinforcement learning
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
 
Adversarial search
Adversarial searchAdversarial search
Adversarial search
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
Minmax Algorithm In Artificial Intelligence slides
Minmax Algorithm In Artificial Intelligence slidesMinmax Algorithm In Artificial Intelligence slides
Minmax Algorithm In Artificial Intelligence slides
 
Practical Swarm Optimization (PSO)
Practical Swarm Optimization (PSO)Practical Swarm Optimization (PSO)
Practical Swarm Optimization (PSO)
 

Similar to Reinforcement learning

An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningPrabhu Kumar
 
reinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdfreinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdfVaishnavGhadge1
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningSVijaylakshmi
 
leewayhertz.com-Reinforcement Learning from Human Feedback RLHF.pdf
leewayhertz.com-Reinforcement Learning from Human Feedback RLHF.pdfleewayhertz.com-Reinforcement Learning from Human Feedback RLHF.pdf
leewayhertz.com-Reinforcement Learning from Human Feedback RLHF.pdfKristiLBurns
 
What is Reinforcement Learning.pdf
What is Reinforcement Learning.pdfWhat is Reinforcement Learning.pdf
What is Reinforcement Learning.pdfAiblogtech
 
Hibridization of Reinforcement Learning Agents
Hibridization of Reinforcement Learning AgentsHibridization of Reinforcement Learning Agents
Hibridization of Reinforcement Learning Agentsbutest
 
Head First Reinforcement Learning
Head First Reinforcement LearningHead First Reinforcement Learning
Head First Reinforcement Learningazzeddine chenine
 
A learning agent.doc
A learning agent.docA learning agent.doc
A learning agent.docjdinfo444
 
REINFORCEMENT LEARNING (reinforced through trial and error).pptx
REINFORCEMENT LEARNING (reinforced through trial and error).pptxREINFORCEMENT LEARNING (reinforced through trial and error).pptx
REINFORCEMENT LEARNING (reinforced through trial and error).pptxarchayacb21
 
Reinforcement Learning Guide For Beginners
Reinforcement Learning Guide For BeginnersReinforcement Learning Guide For Beginners
Reinforcement Learning Guide For Beginnersgokulprasath06
 
Lecture 04 intelligent agents
Lecture 04 intelligent agentsLecture 04 intelligent agents
Lecture 04 intelligent agentsHema Kashyap
 
Reinforcement Learning.pdf
Reinforcement Learning.pdfReinforcement Learning.pdf
Reinforcement Learning.pdfhemayadav41
 
reinforcement-learning-141009013546-conversion-gate02.pptx
reinforcement-learning-141009013546-conversion-gate02.pptxreinforcement-learning-141009013546-conversion-gate02.pptx
reinforcement-learning-141009013546-conversion-gate02.pptxMohibKhan79
 
A Review on Introduction to Reinforcement Learning
A Review on Introduction to Reinforcement LearningA Review on Introduction to Reinforcement Learning
A Review on Introduction to Reinforcement Learningijtsrd
 
acai01-updated.ppt
acai01-updated.pptacai01-updated.ppt
acai01-updated.pptbutest
 
poster_final_v7
poster_final_v7poster_final_v7
poster_final_v7Tie Zheng
 
Andrii Prysiazhnyk: Why the amazon sellers are buiyng the RTX 3080: Dynamic p...
Andrii Prysiazhnyk: Why the amazon sellers are buiyng the RTX 3080: Dynamic p...Andrii Prysiazhnyk: Why the amazon sellers are buiyng the RTX 3080: Dynamic p...
Andrii Prysiazhnyk: Why the amazon sellers are buiyng the RTX 3080: Dynamic p...Lviv Startup Club
 
Agents-and-Problem-Solving-20022024-094442am.pdf
Agents-and-Problem-Solving-20022024-094442am.pdfAgents-and-Problem-Solving-20022024-094442am.pdf
Agents-and-Problem-Solving-20022024-094442am.pdfsyedhasanali293
 
RL_Dr.SNR Final ppt for Presentation 28.05.2021.pptx
RL_Dr.SNR Final ppt for Presentation 28.05.2021.pptxRL_Dr.SNR Final ppt for Presentation 28.05.2021.pptx
RL_Dr.SNR Final ppt for Presentation 28.05.2021.pptxdeeplearning6
 

Similar to Reinforcement learning (20)

An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game Learning
 
reinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdfreinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdf
 
CS3013 -MACHINE LEARNING.pptx
CS3013 -MACHINE LEARNING.pptxCS3013 -MACHINE LEARNING.pptx
CS3013 -MACHINE LEARNING.pptx
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
leewayhertz.com-Reinforcement Learning from Human Feedback RLHF.pdf
leewayhertz.com-Reinforcement Learning from Human Feedback RLHF.pdfleewayhertz.com-Reinforcement Learning from Human Feedback RLHF.pdf
leewayhertz.com-Reinforcement Learning from Human Feedback RLHF.pdf
 
What is Reinforcement Learning.pdf
What is Reinforcement Learning.pdfWhat is Reinforcement Learning.pdf
What is Reinforcement Learning.pdf
 
Hibridization of Reinforcement Learning Agents
Hibridization of Reinforcement Learning AgentsHibridization of Reinforcement Learning Agents
Hibridization of Reinforcement Learning Agents
 
Head First Reinforcement Learning
Head First Reinforcement LearningHead First Reinforcement Learning
Head First Reinforcement Learning
 
A learning agent.doc
A learning agent.docA learning agent.doc
A learning agent.doc
 
REINFORCEMENT LEARNING (reinforced through trial and error).pptx
REINFORCEMENT LEARNING (reinforced through trial and error).pptxREINFORCEMENT LEARNING (reinforced through trial and error).pptx
REINFORCEMENT LEARNING (reinforced through trial and error).pptx
 
Reinforcement Learning Guide For Beginners
Reinforcement Learning Guide For BeginnersReinforcement Learning Guide For Beginners
Reinforcement Learning Guide For Beginners
 
Lecture 04 intelligent agents
Lecture 04 intelligent agentsLecture 04 intelligent agents
Lecture 04 intelligent agents
 
Reinforcement Learning.pdf
Reinforcement Learning.pdfReinforcement Learning.pdf
Reinforcement Learning.pdf
 
reinforcement-learning-141009013546-conversion-gate02.pptx
reinforcement-learning-141009013546-conversion-gate02.pptxreinforcement-learning-141009013546-conversion-gate02.pptx
reinforcement-learning-141009013546-conversion-gate02.pptx
 
A Review on Introduction to Reinforcement Learning
A Review on Introduction to Reinforcement LearningA Review on Introduction to Reinforcement Learning
A Review on Introduction to Reinforcement Learning
 
acai01-updated.ppt
acai01-updated.pptacai01-updated.ppt
acai01-updated.ppt
 
poster_final_v7
poster_final_v7poster_final_v7
poster_final_v7
 
Andrii Prysiazhnyk: Why the amazon sellers are buiyng the RTX 3080: Dynamic p...
Andrii Prysiazhnyk: Why the amazon sellers are buiyng the RTX 3080: Dynamic p...Andrii Prysiazhnyk: Why the amazon sellers are buiyng the RTX 3080: Dynamic p...
Andrii Prysiazhnyk: Why the amazon sellers are buiyng the RTX 3080: Dynamic p...
 
Agents-and-Problem-Solving-20022024-094442am.pdf
Agents-and-Problem-Solving-20022024-094442am.pdfAgents-and-Problem-Solving-20022024-094442am.pdf
Agents-and-Problem-Solving-20022024-094442am.pdf
 
RL_Dr.SNR Final ppt for Presentation 28.05.2021.pptx
RL_Dr.SNR Final ppt for Presentation 28.05.2021.pptxRL_Dr.SNR Final ppt for Presentation 28.05.2021.pptx
RL_Dr.SNR Final ppt for Presentation 28.05.2021.pptx
 

More from SKS

Cloud computing in iot seminar report
Cloud computing in iot seminar reportCloud computing in iot seminar report
Cloud computing in iot seminar reportSKS
 
Uses of ethical theories in professional ethics
Uses of ethical theories in professional ethicsUses of ethical theories in professional ethics
Uses of ethical theories in professional ethicsSKS
 
Deep learning seminar report
Deep learning seminar reportDeep learning seminar report
Deep learning seminar reportSKS
 
Network virtualization seminar report
Network virtualization seminar reportNetwork virtualization seminar report
Network virtualization seminar reportSKS
 
Security in IoT
Security in IoTSecurity in IoT
Security in IoTSKS
 
Variety of moral issues
Variety of moral issuesVariety of moral issues
Variety of moral issuesSKS
 
Research ethics
Research ethicsResearch ethics
Research ethicsSKS
 
Industrial standards
Industrial standardsIndustrial standards
Industrial standardsSKS
 
Engineers are responsible experimenters
Engineers are responsible experimentersEngineers are responsible experimenters
Engineers are responsible experimentersSKS
 
Engineering as experimentation
Engineering as experimentationEngineering as experimentation
Engineering as experimentationSKS
 
Controversy and consensus
Controversy and consensusControversy and consensus
Controversy and consensusSKS
 
Codes of ethics
Codes of ethicsCodes of ethics
Codes of ethicsSKS
 
Codes of ethics
Codes of ethicsCodes of ethics
Codes of ethicsSKS
 
A balanced outlook on the law
A balanced outlook on  the lawA balanced outlook on  the law
A balanced outlook on the lawSKS
 
Theories about the right decision
Theories about the right decision Theories about the right decision
Theories about the right decision SKS
 
Safety and risk
Safety and riskSafety and risk
Safety and riskSKS
 
Risk-benefit analysis
Risk-benefit analysisRisk-benefit analysis
Risk-benefit analysisSKS
 
Reducing risk
Reducing riskReducing risk
Reducing riskSKS
 
Chernobyl case study
Chernobyl case studyChernobyl case study
Chernobyl case studySKS
 
Assessment of safety and risk
Assessment of safety and riskAssessment of safety and risk
Assessment of safety and riskSKS
 

More from SKS (20)

Cloud computing in iot seminar report
Cloud computing in iot seminar reportCloud computing in iot seminar report
Cloud computing in iot seminar report
 
Uses of ethical theories in professional ethics
Uses of ethical theories in professional ethicsUses of ethical theories in professional ethics
Uses of ethical theories in professional ethics
 
Deep learning seminar report
Deep learning seminar reportDeep learning seminar report
Deep learning seminar report
 
Network virtualization seminar report
Network virtualization seminar reportNetwork virtualization seminar report
Network virtualization seminar report
 
Security in IoT
Security in IoTSecurity in IoT
Security in IoT
 
Variety of moral issues
Variety of moral issuesVariety of moral issues
Variety of moral issues
 
Research ethics
Research ethicsResearch ethics
Research ethics
 
Industrial standards
Industrial standardsIndustrial standards
Industrial standards
 
Engineers are responsible experimenters
Engineers are responsible experimentersEngineers are responsible experimenters
Engineers are responsible experimenters
 
Engineering as experimentation
Engineering as experimentationEngineering as experimentation
Engineering as experimentation
 
Controversy and consensus
Controversy and consensusControversy and consensus
Controversy and consensus
 
Codes of ethics
Codes of ethicsCodes of ethics
Codes of ethics
 
Codes of ethics
Codes of ethicsCodes of ethics
Codes of ethics
 
A balanced outlook on the law
A balanced outlook on  the lawA balanced outlook on  the law
A balanced outlook on the law
 
Theories about the right decision
Theories about the right decision Theories about the right decision
Theories about the right decision
 
Safety and risk
Safety and riskSafety and risk
Safety and risk
 
Risk-benefit analysis
Risk-benefit analysisRisk-benefit analysis
Risk-benefit analysis
 
Reducing risk
Reducing riskReducing risk
Reducing risk
 
Chernobyl case study
Chernobyl case studyChernobyl case study
Chernobyl case study
 
Assessment of safety and risk
Assessment of safety and riskAssessment of safety and risk
Assessment of safety and risk
 

Recently uploaded

Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxAvyJaneVismanos
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
CELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxCELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxJiesonDelaCerna
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
Meghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentMeghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentInMediaRes1
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
 
Capitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitolTechU
 
AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.arsicmarija21
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Jisc
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
Painted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaPainted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaVirag Sontakke
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Celine George
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxRaymartEstabillo3
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPCeline George
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 

Recently uploaded (20)

Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptx
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
CELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxCELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptx
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
Meghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentMeghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media Component
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
 
Capitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptx
 
AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
Painted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaPainted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of India
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 

Reinforcement learning

  • 1. REINFORCEMENT LEARNING (RL) SEMINAR REPORT Submitted by KALAISRIRAM.S Reg.No:17TH1222 In partial fulfillment of the requirement for the degree of BACHELOR OF TECHNOLOGY in INFORMATION TECHNOLOGY of PONDICHERRY UNIVERSITY DEPARTMENT OF INFORMATION TECHNOLOGY MANAKULA VINAYAGAR INSTITUTE OF TECHNOLOGY KALITHEERTHALKUPPAM, PUDUCHERRY- 605 107. MARCH-21.
  • 2. MANAKULA VINAYAGAR INSTITUTE OF TECHNOLOGY KALITHEERTHAL KUPPAM, PUDUCHERRY- 605 107 DEPARTMENT OF INFORMATION TECHNOLOGY BONAFIDE CERTIFICATE This is to Certify that the Seminar Report Presentation Titled “REENFORCEMENT LEARNING” is a bonafide record work done by S.KALAISRIRAM (Reg. No:17TH1222) of Seventh Semester B.TECH in INFORMATION TECHNOLOGY for the SEMINAR REPORT during the academic year 2020-21. Staff in charge Head of the Department
  • 3. Table of Contents INTRODUCTION.......................................................................................................................................1 PROCESS OF REINFORCEMENT LEARNING ..................................................................................4 THE BELLMAN EQUATION ..................................................................................................................9 TYPES OF REINFORCEMENT LEARNING......................................................................................10 MARKOV DECISION PROCESS.............................................................................................................11 How to represent the agent state? .......................................................................................................11 1. Markov Decision.............................................................................................................................11 2. Markov Property.........................................................................................................................12 3. Markov Process:..........................................................................................................................12 VARIOUS REINFORCEMENT LEARNING ALGORITHMS..........................................................13 COMPARISON OF REINFORCEMENT LEARNING ALGORITHMS ................................................18 1. Associative reinforcement learning ............................................................................................18 2. Deep reinforcement learning.......................................................................................................18 3. Inverse reinforcement learning ...................................................................................................18 4. Safe Reinforcement Learning .....................................................................................................18 LEARNING MEATHODS.......................................................................................................................20 Difference between Reinforcement learning and Supervised learning:...........................................21 APPLICATIONS ......................................................................................................................................23 Reinforcement Learning Applications....................................................................................................23 PROS AND CONS....................................................................................................................................25 Pros of Reinforcement Learning.............................................................................................................25 Cons of Reinforcement Learning............................................................................................................26 CONCLUSION .........................................................................................................................................28 REFERENCES: ....................................................................................................................................28
  • 4. 1 CHAPTER 1 INTRODUCTION 1. What is Reinforcement learning? • Reinforcement Learning is a feedback-based Machine learning technique in which an agent learns to behave in an environment by performing the actions and seeing the results of actions. For each good action, the agent gets positive feedback, and for each bad action, the agent gets negative feedback or penalty. • In Reinforcement Learning, the agent learns automatically using feedbacks without any labeled data, unlike supervised learning. Since there is no labeled data, so the agent is bound to learn by its experience only. • RL solves a specific type of problem where decision making is sequential, and the goal is long-term, such as game-playing, robotics, etc.The agent interacts with the environment and explores it by itself. • The primary goal of an agent in reinforcement learning is to improve the performance by getting the maximum positive rewards. • The agent learns with the process of hit and trial, and based on the experience, it learns to perform the task in a better way. • Hence, we can say that "Reinforcement learning is a type of machine learning method where an intelligent agent (computer program) interacts with the environment and learns to act within that." .
  • 6. 3 2. Terms used in Reinforcement Learning • Agent(): An entity that can perceive/explore the environment and act upon it. • Environment(): A situation in which an agent is present or surrounded by. In RL, we assume the stochastic environment, which means it is random in nature. • Action(): Actions are the moves taken by an agent within the environment. • State(): State is a situation returned by the environment after each action taken by the agent. • Reward(): A feedback returned to the agent from the environment to evaluate the action of the agent. • Policy(): Policy is a strategy applied by the agent for the next action based on the current state. • Value(): It is expected long-term retuned with the discount factor and opposite to the short-term reward. • Q-value(): It is mostly similar to the value, but it takes one additional parameter as a current action (a). 3. Key Features of Reinforcement Learning o In RL, the agent is not instructed about the environment and what actions need to be taken. o It is based on the hit and trial process. o The agent takes the next action and changes states according to the feedback of the previous action. o The agent may get a delayed reward. o The environment is stochastic, and the agent needs to explore it to reach to get the maximum positive rewards
  • 7. 4 CHAPTER 2 PROCESS OF REINFORCEMENT LEARNING 1. Elements of Reinforcement Learning There are four main elements of Reinforcement Learning, which are given below: 1) Policy: A policy can be defined as a way how an agent behaves at a given time. It maps the perceived states of the environment to the actions taken on those states. A policy is the core element of the RL as it alone can define the behavior of the agent. In some cases, it may be a simple function or a lookup table, whereas, for other cases, it may involve general computation as a search process. It could be deterministic or a stochastic policy: For deterministic policy: a = π(s) For stochastic policy: π(a | s) = P[At =a | St = s] 2) Reward Signal: The goal of reinforcement learning is defined by the reward signal. At each state, the environment sends an immediate signal to the learning agent, and this signal is known as a reward signal. These rewards are given according to the good and bad actions taken by the agent. The agent's main objective is to maximize the total number of rewards for good actions. The reward signal can change the policy, such as if an action selected by the agent leads to low reward, then the policy may change to select other actions in the future. 3) Value Function: The value function gives information about how good the situation and action are and how much reward an agent can expect. A reward indicates the immediate signal for each good and bad action, whereas a value function specifies the good state and action for the future. The value function depends on the reward as, without reward, there could be no value. The goal of estimating values is to achieve more rewards.
  • 8. 5 4) Model: The last element of reinforcement learning is the model, which mimics the behavior of the environment. With the help of the model, one can make inferences about how the environment will behave. Such as, if a state and an action are given, then a model can predict the next state and reward. 2. Approach for implementation There are mainly three ways to implement reinforcement-learning in ML, which are: 1. Value-based: The value-based approach is about to find the optimal value function, which is the maximum value at a state under any policy. Therefore, the agent expects the long-term return at any state(s) under policy π. 2. Policy-based: Policy-based approach is to find the optimal policy for the maximum future rewards without using the value function. In this approach, the agent tries to apply such a policy that the action performed in each step helps to maximize the future reward. The policy-based approach has mainly two types of policy: o Deterministic: The same action is produced by the policy (π) at any state. o Stochastic: In this policy, probability determines the produced action. 3. Model-based: In the model-based approach, a virtual model is created for the environment, and the agent explores that environment to learn it. There is no particular solution or algorithm for this approach because the model representation is different for each environment
  • 9. 6 3. Reinforcement Learning Working To understand the working process of the RL, we need to consider two main things: 1 Environment: It can be anything such as a room, maze, football ground, etc. 2 Agent: An intelligent agent such as AI robot. Let's take an example of a maze environment that the agent needs to explore. Consider the below image: FIG 3 In the above image, the agent is at the very first block of the maze. The maze is consisting of an S6 block, which is a wall, S8 a fire pit, and S4 a diamond block. The agent cannot cross the S6 block, as it is a solid wall. If the agent reaches the S4 block, then get the +1 reward; if it reaches the fire pit,
  • 10. 7 then gets -1 reward point. It can take four actions: move up, move down, move left, and move right. The agent can take any path to reach to the final point, but he needs to make it in possible fewer steps. Suppose the agent considers the path S9- S5-S1-S2-S3, so he will get the +1-reward point. The agent will try to remember the preceding steps that it has taken to reach the final step. To memorize the steps, it assigns 1 value to each previous step. Consider the below step: FIG 4 Now, the agent has successfully stored the previous steps assigning the 1 value to each previous block. But what will the agent do if he starts moving from the block, which has 1 value block on both sides? Consider the below diagram:
  • 11. 8 FIG5 It will be a difficult condition for the agent whether he should go up or down as each block has the same value. So, the above approach is not suitable for the agent to reach the destination. Hence to solve the problem, we will use the Bellman equation, which is the main concept behind reinforcement learning.
  • 12. 9 CHAPTER 3 THE BELLMAN EQUATION The Bellman equation was introduced by the Mathematician Richard Ernest Bellman in the year 1953, and hence it is called as a Bellman equation. It is associated with dynamic programming and used to calculate the values of a decision problem at a certain point by including the values of previous states. It is a way of calculating the value functions in dynamic programming or environment that leads to modern reinforcement learning. The key-elements used in Bellman equations are: o Action performed by the agent is referred to as "a" o State occurred by performing the action is "s." o The reward/feedback obtained for each good and bad action is "R." o A discount factor is Gamma "γ." The Bellman equation can be written as: V(s) = max [R(s,a) + γV(s`)] Where, V(s)= value calculated at a particular point. R(s,a) = Reward at a particular state s by performing an action. γ = Discount factor V(s`) = The value at the previous state.
  • 13. 10 CHAPTER 4 TYPES OF REINFORCEMENT LEARNING There are mainly two types of reinforcement learning, which are: o Positive Reinforcement o Negative Reinforcement Positive Reinforcement: The positive reinforcement learning means adding something to increase the tendency that expected behavior would occur again. It impacts positively on the behavior of the agent and increases the strength of the behavior. This type of reinforcement can sustain the changes for a long time, but too much positive reinforcement may lead to an overload of states that can reduce the consequences. Negative Reinforcement: The negative reinforcement learning is opposite to the positive reinforcement as it increases the tendency that the specific behavior will occur again by avoiding the negative condition. It can be more effective than the positive reinforcement depending on situation and behavior, but it provides reinforcement only to meet minimum behavior.
  • 14. 11 CHAPTER 5 MARKOV DECISION PROCESS How to represent the agent state? We can represent the agent state using the Markov State that contains all the required information from the history. The State St is Markov state if it follows the given condition: P[St+1 | St ] = P[St +1 | S1,......, St] The Markov state follows the Markov property, which says that the future is independent of the past and can only be defined with the present. The RL works on fully observable environments, where the agent can observe the environment and act for the new state. The complete process is known as Markov Decision process Markov Decision Markov Decision Process or MDP, is used to formalize the reinforcement learning problems. If the environment is completely observable, then its dynamic can be modeled as a Markov Process. In MDP, the agent constantly interacts with the environment and performs actions; at each action, the environment responds and generates a new state. FIG 6
  • 15. 12 MDP is used to describe the environment for the RL, and almost all the RL problem can be formalized using MDP. MDP contains a tuple of four elements (S, A, Pa, Ra): o A set of finite States S o A set of finite Actions A o Rewards received after transitioning from state S to state S', due to action a. o Probability Pa. MDP uses Markov property, and to better understand the MDP, we need to learn about it. Markov Property It says that "If the agent is present in the current state S1, performs an action a1 and move to the state s2, then the state transition from s1 to s2 only depends on the current state and future action and states do not depend on past actions, rewards, or states." Or, in other words, as per Markov Property, the current state transition does not depend on any past action or state. Hence, MDP is an RL problem that satisfies the Markov property. Such as in a Chess game, the players only focus on the current state and do not need to remember past actions or states. Finite MDP: A finite MDP is when there are finite states, finite rewards, and finite actions. In RL, we consider only the finite MDP. Markov Process: Markov Process is a memoryless process with a sequence of random states S1, S2, ....., St that uses the Markov Property. Markov process is also known as Markov chain, which is a tuple (S, P) on state S and transition function P. These two components (S and P) can define the dynamics of the system.
  • 16. 13 CHAPTER 6 VARIOUS REINFORCEMENT LEARNING ALGORITHMS Reinforcement learning algorithms are mainly used in AI applications and gaming applications. The main used algorithms are: 1. Q-Learning: a. Q-learning is an Off policy RL algorithm, which is used for the temporal difference Learning. The temporal difference learning methods are the way of comparing temporally successive predictions. b. It learns the value function Q (S, a), which means how good to take action "a" at a particular state "s." c. The below flowchart explains the working of Q- learning: FIG 7
  • 17. 14 2. State Action Reward State action (SARSA) o SARSA stands for State Action Reward State action, which is an on-policy temporal difference learning method. The on-policy control method selects the action for each state while learning using a specific policy. o The goal of SARSA is to calculate the Q π (s, a) for the selected current policy π and all pairs of (s-a). o The main difference between Q-learning and SARSA algorithms is that unlike Q-learning, the maximum reward for the next state is not required for updating the Q-value in the table. o In SARSA, new action and reward are selected using the same policy, which has determined the original action. o The SARSA is named because it uses the quintuple Q(s, a, r, s', a'). Where, s: original state a: Original action r: reward observed while following the states s' and a': New state, action pair. 3. Deep Q Neural Network (DQN): • As the name suggests, DQN is a Q-learning using Neural networks. • For a big state space environment, it will be a challenging and complex task to define and update a Q-table. • To solve such an issue, we can use a DQN algorithm. Where, instead of defining a Q-table, neural network approximates the Q- values for each action and state • Another two techniques are also essential for training DQN:
  • 18. 15 1. Experience Replay 2. Separate Target Network: 4. Deep Deterministic Policy Gradient (DDPG) DDPG relies on the actor-critic architecture with two eponymous elements, actor and critic. An actor is used to tune the parameter 𝜽 for the policy function, i.e. decide the best action for a specific state. A critic is used for evaluating the policy function estimated by the actor according to the temporal difference (TD) error. Here, the lower-case v denotes the policy that the actor has decided. Does it look familiar? Yes! It looks just like the Q-learning update equation! TD learning is a way to learn how to predict a value depending on future values of a given state. Q-learning is a specific type of TD learning for learning Q-value.
  • 19. 16 FIG 8 DDPG also borrows the ideas of experience replay and separate target network from DQN . Another issue for DDPG is that it seldom performs exploration for actions. A solution for this is adding noise on the parameter space or the action space.
  • 20. 17 Classification Of RL Algorithm FIG 9
  • 21. 18 CHAPTER 7 COMPARISON OF REINFORCEMENT LEARNING ALGORITHMS 1. Associative reinforcement learning Associative reinforcement learning tasks combine facets of stochastic learning automata tasks and supervised learning pattern classification tasks. In associative reinforcement learning tasks, the learning system interacts in a closed loop with its environment 2. Deep reinforcement learning This approach extends reinforcement learning by using a deep neural network and without explicitly designing the state space. The work on learning ATARI games by Google DeepMind increased attention to deep reinforcement learning or end-to-end reinforcement learning. 3. Inverse reinforcement learning In inverse reinforcement learning (IRL), no reward function is given. Instead, the reward function is inferred given an observed behavior from an expert. The idea is to mimic observed behavior, which is often optimal or close to optimal. 4. Safe Reinforcement Learning Safe Reinforcement Learning (SRL) can be defined as the process of learning policies that maximize the expectation of the return in problems in which it is important to ensure reasonable system performance and/or respect safety constraints during the learning and/or deployment processes.
  • 23. 20 CHAPTER 8 LEARNING MEATHODS These methods are different from previously studied methods and very rarely used also. In this kind of learning algorithms, there would be an agent that we want to train over a period of time so that it can interact with a specific environment. The agent will follow a set of strategies for interacting with the environment and then after observing the environment it will take actions regards the current state of the environment. The following are the main steps of reinforcement learning methods − 1. First, we need to prepare an agent with some initial set of strategies. 2. Then observe the environment and its current state. 3. Next, select the optimal policy regards the current state of the environment and perform important action. 4. Now, the agent can get corresponding reward or penalty as per accordance with the action taken by it in previous step. 5. Now, we can update the strategies if it is required so. 6. At last, repeat steps 2-5 until the agent got to learn and adopt the optimal policies.
  • 24. 21 Difference between Reinforcement learning and Supervised learning: Reinforcement learning Supervised learning Reinforcement learning is all about making decisions sequentially. In simple words we can say that the output depends on the state of the current input and the next input depends on the output of the previous input In Supervised learning the decision is made on the initial input or the input given at the start In Reinforcement learning decision is dependent, So we give labels to sequences of dependent decisions Supervised learning the decisions are independent of each other so labels are given to each decision. Example: Chess game Example: Object recognition TABLE 1
  • 26. 23 CHAPTER 9 APPLICATIONS Reinforcement Learning Applications FIG 11 1. Robotics: 1. RL is used in Robot navigation, Robo-soccer, walking, juggling, etc. 2. Control: 1. RL can be used for adaptive control such as Factory processes, admission control in telecommunication, and Helicopter pilot is an example of reinforcement learning. 3. Game Playing: 1. RL can be used in Game playing such as tic-tac-toe, chess, etc. 4. Chemistry: 1. RL can be used for optimizing the chemical reactions.
  • 27. 24 5. Business: 1. RL is now used for business strategy planning. 6. Manufacturing: 1. In various automobile manufacturing companies, the robots use deep reinforcement learning to pick goods and put them in some containers. 7. Finance Sector: 1. The RL is currently used in the finance sector for evaluating trading strategies.
  • 28. 25 CHAPTER 10 PROS AND CONS Pros of Reinforcement Learning • Reinforcement learning can be used to solve very complex problems that cannot be solved by conventional techniques. • This technique is preferred to achieve long-term results, which are very difficult to achieve. • This learning model is very similar to the learning of human beings. Hence, it is close to achieving perfection. • The model can correct the errors that occurred during the training process. • Once an error is corrected by the model, the chances of occurring the same error are very less. • It can create the perfect model to solve a particular problem. • Robots can implement reinforcement learning algorithms to learn how to walk. • In the absence of a training dataset, it is bound to learn from its experience. • Reinforcement learning models can outperform humans in many tasks. DeepMind’s AlphaGo program, a reinforcement learning model, beat the world champion Lee Sedol at the game of Go in March 2016. • Reinforcement learning is intended to achieve the ideal behavior of a model within a specific context, to maximize its performance. • It can be useful when the only way to collect information about the environment is to interact with it
  • 29. 26 • Reinforcement learning algorithms maintain a balance between exploration and exploitation. Exploration is the process of trying different things to see if they are better than what has been tried before. Exploitation is the process of trying the things that have worked best in the past. Other learning algorithms do not perform this balance Cons of Reinforcement Learning • Reinforcement learning as a framework is wrong in many different ways, but it is precisely this quality that makes it useful. • Too much reinforcement learning can lead to an overload of states, which can diminish the results. • Reinforcement learning is not preferable to use for solving simple problems. • Reinforcement learning needs a lot of data and a lot of computation. It is data-hungry. That is why it works really well in video games because one can play the game again and again and again, so getting lots of data seems feasible. • Reinforcement learning assumes the world is Markovian, which it is not. The Markovian model describes a sequence of possible events in which the probability of each event depends only on the state attained in the previous event. • The curse of dimensionality limits reinforcement learning heavily for real physical systems. According to Wikipedia, the curse of dimensionality refers to various phenomena that arise when analyzing and organizing data in high-dimensional spaces that do not occur in low-dimensional settings such as the three-dimensional physical space of everyday experience. • Another disadvantage is the curse of real-world samples. For example, consider the case of learning by robots. The robot hardware is usually
  • 30. 27 very expensive, suffers from wear and tear, and requires careful maintenance. Repairing a robot system is costs a lot. • To solve many problems of reinforcement learning, we can use a combination of reinforcement learning with other techniques rather than leaving it altogether. One popular combination is Reinforcement learning with Deep Learning.
  • 31. 28 CHAPTER 11 CONCLUSION We can say that Reinforcement Learning is one of the most interesting and useful parts of Machine learning. In RL, the agent explores the environment by exploring it without any human intervention. It is the main learning algorithm that is used in Artificial Intelligence. But there are some cases where it should not be used, such as if you have enough data to solve the problem, then other ML algorithms can be used more efficiently. The main issue with the RL algorithm is that some of the parameters may affect the speed of the learning, such as delayed feedback. REFERENCES: 1. https://www.javatpoint.com/reinforcement-learning 2. https://towardsdatascience.com/introduction-to-various- reinforcement-learning-algorithms-i-q-learning-sarsa-dqn- ddpg-72a5e0cb6287 3. https://en.wikipedia.org/wiki/Reinforcement_learning#Compar ison_of_reinforcement_learning_algorithms