2. GROUP MEMBERS
20BAI10254 ARCHIT SRIVASTAVA
20BAI10321 KARTHIK BISHT
20BAI10154 HARSH SEN
20BAI10129 DIVYANSHU CHETAN
3. INTRODUCTION
Reinforcement learning is an important sub-category of machine learning.
Reinforcement learning (RL) is a machine learning method that does not
require the raw data to be labeled, as is required typically with machine
learning. Reinforcement learning helps determine if an algorithm is producing
a correct right answer or a reward indicating it was a good decision.
RL is based on interactions between an AI system and its environment. An
algorithm receives a numerical score based on its outcome and then the
positive behaviors are “reinforced” to refine the algorithm over time. In recent
years, RL has been behind super-human performance on GO, Atari games and
many other applications.
4. WHAT IS NAÏVE REINFORCE ALGORITHM AND
HOW DOES IT WORK
REINFORCE is a part of the exclusive category of Policy Gradient algorithms
used in Reinforcement Learning.
Making a Policy—a model that receives a state as input and outputs the
probability of executing an action—would be a straightforward way to execute
this approach.
A policy is simply a manual or cheat sheet that instructs the agent on what to
do in each state.
The policy is then improved upon iteratively, with minor changes made at each
stage, until we have a policy that addresses the environment.
5. The policy is usually a Neural Network that takes the state as input and
generates a probability distribution across action space as output whose
objective is to maximize the “Expected reward”.
Each policy determines the likelihood that a particular action will be taken at
each station in the environment.
6. The agent samples from these probabilities and selects an action to perform in
the environment. At the end of an episode, we know the total rewards the
agent can get if it follows that policy. We backpropagate the reward through
the path the agent took to estimate the “Expected reward” at each state for a
given policy.
The expected reward is given as the sum of the probability of an action in state
s multiplied by the discounted reward.
Here the discounted reward is the sum of all the rewards the agent receives in
that future discounted by a factor Gamma.
As per the original implementation of the REINFORCE algorithm, the
Expected reward is the sum of products of a log of probabilities and discounted
rewards.
7. Using the policy gradient theorem, we can devise a naive algorithm that uses
gradient ascent to update our policy parameters.
The theorem gives a sum over all states and operations, but we only use the
sample gradient when updating the parameters because we simply cannot
get the gradient of all possible operations and states.
8. STEPS INVOLVED
The steps involved in the implementation of REINFORCE would be as follows:
Initialize a Random Policy (a NN that takes the state as input and returns the
probability of actions)
Use the policy to play N steps of the game — record action probabilities-from
policy, reward-from environment, action — sampled by agent
Calculate the discounted reward for each step by backpropagation
Calculate expected reward G
Adjust weights of Policy (back-propagate error in NN) to increase G
Repeat from 2
9.
10. Naïve REINFORCE Characteristics
Naïve REINFORCE is a gradient policy algorithm. Policy-Gradient methods are
a subclass of Policy-Based methods that estimate an optimal policy’s weights
through gradient ascent.
This algorithm is the fundamental policy gradient algorithm on which nearly
all the advanced policy gradient algorithms are based.
REINFORCE is a family of reinforcement learning methods which REINFORCE
is a family of reinforcement learning methods which directly update the policy
weights.
11. Policy gradient algorithms attempt to determine the best policy by learning an
estimate of the action values rather than computing the action values as with
Q-value approaches.
Unlike Q-Learning, these methods return a probability distribution over the
actions rather than an action vector.
REINFORCE algorithm fined an unbiased estimate of the gradient, but without
the assistance of a learned value function. REINFORCE learns much more
slowly than RL methods using value functions.