Reinforcement
Learning
Reinforcement learning
• Reinforcement learning, in a simplistic definition, is learning best
actions based on reward or punishment.
• There are three basic concepts in reinforcement learning:
 State
 Action
 and reward
• In this picture this lady wants to train
her dog.
• Then she order to her dog to perform certain
action and for every proper execution she
would give an orrange as a reward to dog.
• The dog will remember that if I do a certain
action then I would get an orrange .
STATE:
• The state describes the current situation. For a robot that is learning
to walk, the state is the position of its two legs.
ACTION:
• Action is what an agent can do in each state.
• Given the state, or positions of its two legs, a robot can take steps
within a certain distance.
REWARD:
• When a robot takes an action in a state, it receives a reward.
• Here the term “reward” is an abstract concept that describes
feedback from the environment.
• When the reward is positive, it is corresponding to our normal
meaning of reward.
• When the reward is negative, it is corresponding to what we usually
call “punishment."
RL Cont...
• A robot learns to go through a maze.
• When the robot takes one step to the right, it reaches an open
location, if it is going right for three steps, the robot hits a wall.
• The robot that is running through the maze remembers every wall it
hits.
• In the end, it remembers the previous actions that lead to dead ends.
• It also remembers the path (that is, a sequence of actions) that leads
it successfully through the maze.
RL Cont...
• The essential goal of reinforcement learning is learning a sequence
of actions that lead to a long-term reward.
• An agent learns that sequence by interacting with the environment
and observing the rewards in every state.
Q-learning: A commonly used reinforcement
learning method
• Q-learning is the most commonly used reinforcement learning
method, where Q stands for the long-term value of an action.
• Q-learning is about learning Q-values through observations.
• The procedure for Q-learning is:
• Q(state, action) = (1-learning_rate)Q(state, action) +
learning_rate(r+ discount_rate *max_a(Q(state’, action)))
• In the beginning, the agent initializes Q-values to 0 for every state-
action pair. More precisely, Q(state, action) = 0 for all states s and
actions a.
• After the agent starts learning, it takes an action a in state s and
receives reward r.
RL Cont..
• It also observes that the state has changed to a
new state s’. The agent will update Q(state, action) with above
formula.
• The learning rate is a number between 0 and 1.It is a weight given
to the new information versus the old information.
• The new long-term reward is the current reward, r, plus all
future rewards in the next state, s’, and later states, assuming this
agent always takes its best actions in the future.
RL Cont..
• The future rewards are discounted by a discount rate between 0
and 1, meaning future rewards are not as valuable as the reward now.
• As the agent visits all the states and tries different actions,
it eventually learns the optimal Q-values for all possible state-
action pairs. Then it can derive the action in every state that is
optimal for the long term.
Maze robot example :
RL Cont..
• The robot starts from the lower left corner of the maze.
• Each location (state) is indicated by a number.
• There are four action choices (left, right, up, down), but in certain states,
action choices are limited.
• For example, in state 1 (initial state), the robot has only two
action choices: up or right. I
• In state 4, it has three action choices: left, right, or up.
• When the robot hits a wall, it receives reward -1.
• When it reaches an open location, it receives reward 0.
• When it reaches the exit, it receives reward 100.
RL Cont..
• Q(state, action) = (1-learning_rate)Q(state, action)
+ learning_rate (r+ discount_rate x max_a (Q(state’, action)))
• Where the learning rate is 0.2 and discount rate is 0.9
• Q(4, left) = 0.8 x 0+ 0.2 (0+0.9 Q(1,right))
• Q(4, right) = 0.8 x 0+ 0.2 (0+0.9 Q(5,up))
• Thus Q(5,up) has a higher value than Q(1,right)
• For this reason, Q(4,right) has a higher value than Q(4, left).
• Thus, the best action in state 4 is going right.
Advantages of Reinforcement Learning
• It can solve higher-order and complex problems. Also, the solutions
obtained will be very accurate.
• The reason for its perfection is that it is very similar to the human
learning technique.
• Due to it’s learning ability, it can be used with neural networks. This
can be termed as deep reinforcement learning.
• The best part is that even when there is no training data, it will
learn through the experience it has from processing the training data.
Disadvantages of Reinforcement Learning
• This consumes time and lots of computational power.
THANK YOU

Reinforcement learning.pptx

  • 1.
  • 2.
    Reinforcement learning • Reinforcementlearning, in a simplistic definition, is learning best actions based on reward or punishment. • There are three basic concepts in reinforcement learning:  State  Action  and reward
  • 3.
    • In thispicture this lady wants to train her dog. • Then she order to her dog to perform certain action and for every proper execution she would give an orrange as a reward to dog. • The dog will remember that if I do a certain action then I would get an orrange .
  • 5.
    STATE: • The statedescribes the current situation. For a robot that is learning to walk, the state is the position of its two legs. ACTION: • Action is what an agent can do in each state. • Given the state, or positions of its two legs, a robot can take steps within a certain distance. REWARD: • When a robot takes an action in a state, it receives a reward. • Here the term “reward” is an abstract concept that describes feedback from the environment.
  • 6.
    • When thereward is positive, it is corresponding to our normal meaning of reward. • When the reward is negative, it is corresponding to what we usually call “punishment."
  • 7.
    RL Cont... • Arobot learns to go through a maze. • When the robot takes one step to the right, it reaches an open location, if it is going right for three steps, the robot hits a wall. • The robot that is running through the maze remembers every wall it hits. • In the end, it remembers the previous actions that lead to dead ends. • It also remembers the path (that is, a sequence of actions) that leads it successfully through the maze.
  • 8.
    RL Cont... • Theessential goal of reinforcement learning is learning a sequence of actions that lead to a long-term reward. • An agent learns that sequence by interacting with the environment and observing the rewards in every state.
  • 9.
    Q-learning: A commonlyused reinforcement learning method • Q-learning is the most commonly used reinforcement learning method, where Q stands for the long-term value of an action. • Q-learning is about learning Q-values through observations. • The procedure for Q-learning is: • Q(state, action) = (1-learning_rate)Q(state, action) + learning_rate(r+ discount_rate *max_a(Q(state’, action))) • In the beginning, the agent initializes Q-values to 0 for every state- action pair. More precisely, Q(state, action) = 0 for all states s and actions a. • After the agent starts learning, it takes an action a in state s and receives reward r.
  • 10.
    RL Cont.. • Italso observes that the state has changed to a new state s’. The agent will update Q(state, action) with above formula. • The learning rate is a number between 0 and 1.It is a weight given to the new information versus the old information. • The new long-term reward is the current reward, r, plus all future rewards in the next state, s’, and later states, assuming this agent always takes its best actions in the future.
  • 11.
    RL Cont.. • Thefuture rewards are discounted by a discount rate between 0 and 1, meaning future rewards are not as valuable as the reward now. • As the agent visits all the states and tries different actions, it eventually learns the optimal Q-values for all possible state- action pairs. Then it can derive the action in every state that is optimal for the long term.
  • 12.
  • 13.
    RL Cont.. • Therobot starts from the lower left corner of the maze. • Each location (state) is indicated by a number. • There are four action choices (left, right, up, down), but in certain states, action choices are limited. • For example, in state 1 (initial state), the robot has only two action choices: up or right. I • In state 4, it has three action choices: left, right, or up. • When the robot hits a wall, it receives reward -1. • When it reaches an open location, it receives reward 0. • When it reaches the exit, it receives reward 100.
  • 14.
    RL Cont.. • Q(state,action) = (1-learning_rate)Q(state, action) + learning_rate (r+ discount_rate x max_a (Q(state’, action))) • Where the learning rate is 0.2 and discount rate is 0.9 • Q(4, left) = 0.8 x 0+ 0.2 (0+0.9 Q(1,right)) • Q(4, right) = 0.8 x 0+ 0.2 (0+0.9 Q(5,up)) • Thus Q(5,up) has a higher value than Q(1,right) • For this reason, Q(4,right) has a higher value than Q(4, left). • Thus, the best action in state 4 is going right.
  • 16.
    Advantages of ReinforcementLearning • It can solve higher-order and complex problems. Also, the solutions obtained will be very accurate. • The reason for its perfection is that it is very similar to the human learning technique. • Due to it’s learning ability, it can be used with neural networks. This can be termed as deep reinforcement learning. • The best part is that even when there is no training data, it will learn through the experience it has from processing the training data.
  • 17.
    Disadvantages of ReinforcementLearning • This consumes time and lots of computational power.
  • 19.