2. Reinforcement learning
• Reinforcement learning, in a simplistic definition, is learning best
actions based on reward or punishment.
• There are three basic concepts in reinforcement learning:
State
Action
and reward
3. • In this picture this lady wants to train
her dog.
• Then she order to her dog to perform certain
action and for every proper execution she
would give an orrange as a reward to dog.
• The dog will remember that if I do a certain
action then I would get an orrange .
4.
5. STATE:
• The state describes the current situation. For a robot that is learning
to walk, the state is the position of its two legs.
ACTION:
• Action is what an agent can do in each state.
• Given the state, or positions of its two legs, a robot can take steps
within a certain distance.
REWARD:
• When a robot takes an action in a state, it receives a reward.
• Here the term “reward” is an abstract concept that describes
feedback from the environment.
6. • When the reward is positive, it is corresponding to our normal
meaning of reward.
• When the reward is negative, it is corresponding to what we usually
call “punishment."
7. RL Cont...
• A robot learns to go through a maze.
• When the robot takes one step to the right, it reaches an open
location, if it is going right for three steps, the robot hits a wall.
• The robot that is running through the maze remembers every wall it
hits.
• In the end, it remembers the previous actions that lead to dead ends.
• It also remembers the path (that is, a sequence of actions) that leads
it successfully through the maze.
8. RL Cont...
• The essential goal of reinforcement learning is learning a sequence
of actions that lead to a long-term reward.
• An agent learns that sequence by interacting with the environment
and observing the rewards in every state.
9. Q-learning: A commonly used reinforcement
learning method
• Q-learning is the most commonly used reinforcement learning
method, where Q stands for the long-term value of an action.
• Q-learning is about learning Q-values through observations.
• The procedure for Q-learning is:
• Q(state, action) = (1-learning_rate)Q(state, action) +
learning_rate(r+ discount_rate *max_a(Q(state’, action)))
• In the beginning, the agent initializes Q-values to 0 for every state-
action pair. More precisely, Q(state, action) = 0 for all states s and
actions a.
• After the agent starts learning, it takes an action a in state s and
receives reward r.
10. RL Cont..
• It also observes that the state has changed to a
new state s’. The agent will update Q(state, action) with above
formula.
• The learning rate is a number between 0 and 1.It is a weight given
to the new information versus the old information.
• The new long-term reward is the current reward, r, plus all
future rewards in the next state, s’, and later states, assuming this
agent always takes its best actions in the future.
11. RL Cont..
• The future rewards are discounted by a discount rate between 0
and 1, meaning future rewards are not as valuable as the reward now.
• As the agent visits all the states and tries different actions,
it eventually learns the optimal Q-values for all possible state-
action pairs. Then it can derive the action in every state that is
optimal for the long term.
13. RL Cont..
• The robot starts from the lower left corner of the maze.
• Each location (state) is indicated by a number.
• There are four action choices (left, right, up, down), but in certain states,
action choices are limited.
• For example, in state 1 (initial state), the robot has only two
action choices: up or right. I
• In state 4, it has three action choices: left, right, or up.
• When the robot hits a wall, it receives reward -1.
• When it reaches an open location, it receives reward 0.
• When it reaches the exit, it receives reward 100.
14. RL Cont..
• Q(state, action) = (1-learning_rate)Q(state, action)
+ learning_rate (r+ discount_rate x max_a (Q(state’, action)))
• Where the learning rate is 0.2 and discount rate is 0.9
• Q(4, left) = 0.8 x 0+ 0.2 (0+0.9 Q(1,right))
• Q(4, right) = 0.8 x 0+ 0.2 (0+0.9 Q(5,up))
• Thus Q(5,up) has a higher value than Q(1,right)
• For this reason, Q(4,right) has a higher value than Q(4, left).
• Thus, the best action in state 4 is going right.
15.
16. Advantages of Reinforcement Learning
• It can solve higher-order and complex problems. Also, the solutions
obtained will be very accurate.
• The reason for its perfection is that it is very similar to the human
learning technique.
• Due to it’s learning ability, it can be used with neural networks. This
can be termed as deep reinforcement learning.
• The best part is that even when there is no training data, it will
learn through the experience it has from processing the training data.