Unit 5: Reinforcement
Learning
Adama Science and Technology University
School of Electrical Engineering and Computing
Department of CSE
Dr. Mesfin Abebe Haile (2021)
Outline
 Why Reinforcement Learning?
 Reinforcement Learning Process
 Reinforcement learning Definition
 Reinforcement Learning Concepts
 Understanding Q-Learning
02/04/25 2
Reinforcement Learning
 Machine Learning: is a subset of AI which provides machines the ability to learn automatically and improve from experience
without being explicitly programmed.
 Type of Machine Learning:
 Supervised Learning (label provided for every input)
 Unsupervised Learning (no label provided)
 Reinforcement Learning (classical AI and ML)
 Reinforcement learning: is a type of machine learning where an agent learns to behave in an environment by performing
actions and seeing the results.
02/04/25 3
Reinforcement Learning
 Example of RL:
 A robot cleaning a room,
 Game playing,
 Learning how to fly a helicopter,
 Scheduling planes to their destinations,
 And so on…
 Reinforcement learning with an analogy:
 Scenario1: Baby starts crawling and move into the candy. (candy)
 Scenario2: Baby starts crawling but due to some hurdle in between. (no candy)
02/04/25 4
Reinforcement Learning
Process
 Reinforcement learning system is composed of two main components:
 Agent
 Environment
02/04/25 5
 The action influences the state of the world which determines its
reward.
Reinforcement Learning
Process
 But, we don’t know anything about the environment model—the
transition function T(s,a,s’).
 Here comes two approaches:
 Model based approach RL:
 Learn the model, and use it to derive the optimal policy.
 E.g Adaptive dynamic learning (ADP) approach
 Model free approach RL:
 Derive the optimal policy without learning the model.
 E.g LMS and Temporal difference approach
 Which one is better?
02/04/25 6
Reinforcement Learning
Process
 Reinforcement learning system has complications:
The outcome of your actions may be uncertain,
 You may not be able to perfectly sense the state of the world,
 The reward may be stochastic,
 Reward is delayed (i.e. finding food in a maze),
 You may have no clue (model) about how the world responds to
your actions,
The world may change while you try to learn it,
 How much time do you need to explore uncharted territory before
you exploit what you have learned?
02/04/25 7
Reinforcement Learning
Definitions
 Agent: The RL algorithm that learns from trial and error.
 Environment: The world through which the agent moves.
 Action (A): All the possible steps that the agent can take.
 State (S): Current condition returned by the environment.
 Reward (R): An instant return from the environment to appraise the last action.
02/04/25 8
Reinforcement Learning
Definitions
 Policy (): The approach that the agent uses to determine the next action based on the current state.
 Values (V): The expected long-term return with discount, as opposed to the short-term reward R.
 Action-value (Q): This is similar to Value, except, it takes an extra parameter, the current action (A).
02/04/25 9
Reinforcement Learning
Concept
 Reward maximization theory states that, a RL agent must be trained in such a way that he takes the best action so that the
reward is maximum.
 Exploitation: is about using the already known exploited information to intensify the rewards.
 Exploration: is about exploring and capturing more information about an environment.
 Q-learning is used, which simply learns from experience.
 No model of the world is needed.
02/04/25 10
Markove Decision Process
(MDP)
 The mathematical approach for mapping a solution in reinforcement learning is called Markov Decision Process.
 The following parameters are used to apply a solution using MDP:
 Set of action, A
 Set of states, S
 Reward, R
 Policy, 
 Value, V
02/04/25 11
Markove Decision Process
(MDP)
 Example: using shortest path problem.
 Goal: Find the shortest path between node A and D with minimum possible cost.
 Set of states are denoted by nodes i.e. {A, B, C, D}
 Action is to traverse from one node to another {A->B, C->D}
 Reward is the cost represent by each edge.
 Policy is the path taken to reach the destination {A->C->D}
02/04/25 12
A
B
C
D
15
-10
-20
30
10
50
Q-Learning
 Q-learning is a value-based reinforcement learning algorithm used to find the
optimal action-selection policy using a Q function.
 Example: Suppose a building with 5 rooms numbered 0–4.
 An agent is placed in any one of the rooms and the goal is to reach outside the
building (room 5).
 Doors 1 and 4 lead into the building from room 5. A reward value is associated
with each door.
02/04/25 13
Q-Learning with Example
 Example: place an agent in any one of the rooms (0, 1, 2, 3, 4) and the goal is to reach outside the building (room 5).
02/04/25 14
 5 Rooms in a building connected
by doors.
 Each room is numbered 0 through
4.
 The outside of the building can be
thought of as one big room (5).
 Doors 1 and 4 leads into the
building from room 5 (outside).
Q-Learning with Example
 Let’s represent the rooms as a graph, each room as a node, and each door as a link.
02/04/25 15
Q-Learning with Example
 Next step is to associate a reward value to each door.
02/04/25 16
 Doors that lead directly to the goal
have a reward of 100.
 Doors not directly connected to
the target room have zero reward.
 Because doors are two-way, two
arrows are assigned to each room.
 Each arrows contains an instant
reward value.
Q-Learning with Example
 The terminology in Q-Learning includes the terms state and action.
 Room (including room 5 ) represents a state.
 Agent’s movement from one room to another represents an action.
 A state is depicted as a node, while action represented as arrows.
02/04/25 17
 Example (Agent traverse from
room 2 to room 5).
 Initial state = State 2,
 State 2 -> State 3,
 State 3 -> State (2, 1, 4),
 State 4 -> State 5
Q-Learning with Example
 We can put the sate diagram and the instant reward values into a reward table, matrix R. (-1 no connection, 0 connection and 100 connected to the goal)
02/04/25 18
Q-Learning with Example
 Another matrix called the Q matrix is constructed that depicts the memory of what the agent has
learned up until now from its experiences.
 A Q-Table is simply a lookup table where we calculate the maximum expected future rewards for
action at each state.
 The rows of the Q-matrix represent the current state of the agent, while the columns represent the
possible actions leading to the next state.
02/04/25 19
Q-Learning with Example
 The Q-function uses the Bellman equation and takes two inputs: state (s) and action (a).
 The formula for calculating the Q matrix is as follows:
02/04/25 20
 First, the Q matrix is initialized as a zero matrix.
 Taking the example of the room problem described above and
make the initial state as Room 1.
 Q (1,5) is calculated by taking Gamma ( ) value 0.8.
𝛾
Q-Learning with Example
 From room 1 we can either go to room 3 or 5, let’s take room 5.
 From room 5, calculate maximum Q value for this next state based on all possible actions:
02/04/25 21
1
3
5
1
4
5
Q-Learning with Example
 Another example: let us start (initial state), i.e., state 3.
 From room 3 we can go to room 1, 2, and 4, let’s take room 1.
 From room 1, calculate maximum Q value:
02/04/25 22
3
1
4
2
3
5
Q-Learning with Example
 The formula for calculating the Q matrix is as follows:
02/04/25 23
 The Gamma parameter has a range of 0 to1 (0 <= Gamma < 1).
If Gamma is closer to zero, the agent will tend to consider only
immediate rewards, (exploitation)
If Gamma is closer to one, the agent will consider future rewards
greater weight. (exploration)
Q-Learning with Example
 The Q-learning algorithm:
 Set the gamma parameter and environment rewards in the reward matrix (R),
 Initialize matrix Q to zero,
 Select a random initial state and Set initial state=current state,
 Select one among all possible actions for the current state,
 Using this possible action, consider going to the next state,
 Get the maximum Q value for this next state based on all possible actions,
 Compute Q(state, action) using the above formula,
 Repeat the above steps until current state=goal state.


02/04/25 24
Question & Answer
02/04/25 25
Thank You !!!
02/04/25 26

Lecture -10 AI Reinforcement Learning.ppt

  • 1.
    Unit 5: Reinforcement Learning AdamaScience and Technology University School of Electrical Engineering and Computing Department of CSE Dr. Mesfin Abebe Haile (2021)
  • 2.
    Outline  Why ReinforcementLearning?  Reinforcement Learning Process  Reinforcement learning Definition  Reinforcement Learning Concepts  Understanding Q-Learning 02/04/25 2
  • 3.
    Reinforcement Learning  MachineLearning: is a subset of AI which provides machines the ability to learn automatically and improve from experience without being explicitly programmed.  Type of Machine Learning:  Supervised Learning (label provided for every input)  Unsupervised Learning (no label provided)  Reinforcement Learning (classical AI and ML)  Reinforcement learning: is a type of machine learning where an agent learns to behave in an environment by performing actions and seeing the results. 02/04/25 3
  • 4.
    Reinforcement Learning  Exampleof RL:  A robot cleaning a room,  Game playing,  Learning how to fly a helicopter,  Scheduling planes to their destinations,  And so on…  Reinforcement learning with an analogy:  Scenario1: Baby starts crawling and move into the candy. (candy)  Scenario2: Baby starts crawling but due to some hurdle in between. (no candy) 02/04/25 4
  • 5.
    Reinforcement Learning Process  Reinforcementlearning system is composed of two main components:  Agent  Environment 02/04/25 5  The action influences the state of the world which determines its reward.
  • 6.
    Reinforcement Learning Process  But,we don’t know anything about the environment model—the transition function T(s,a,s’).  Here comes two approaches:  Model based approach RL:  Learn the model, and use it to derive the optimal policy.  E.g Adaptive dynamic learning (ADP) approach  Model free approach RL:  Derive the optimal policy without learning the model.  E.g LMS and Temporal difference approach  Which one is better? 02/04/25 6
  • 7.
    Reinforcement Learning Process  Reinforcementlearning system has complications: The outcome of your actions may be uncertain,  You may not be able to perfectly sense the state of the world,  The reward may be stochastic,  Reward is delayed (i.e. finding food in a maze),  You may have no clue (model) about how the world responds to your actions, The world may change while you try to learn it,  How much time do you need to explore uncharted territory before you exploit what you have learned? 02/04/25 7
  • 8.
    Reinforcement Learning Definitions  Agent:The RL algorithm that learns from trial and error.  Environment: The world through which the agent moves.  Action (A): All the possible steps that the agent can take.  State (S): Current condition returned by the environment.  Reward (R): An instant return from the environment to appraise the last action. 02/04/25 8
  • 9.
    Reinforcement Learning Definitions  Policy(): The approach that the agent uses to determine the next action based on the current state.  Values (V): The expected long-term return with discount, as opposed to the short-term reward R.  Action-value (Q): This is similar to Value, except, it takes an extra parameter, the current action (A). 02/04/25 9
  • 10.
    Reinforcement Learning Concept  Rewardmaximization theory states that, a RL agent must be trained in such a way that he takes the best action so that the reward is maximum.  Exploitation: is about using the already known exploited information to intensify the rewards.  Exploration: is about exploring and capturing more information about an environment.  Q-learning is used, which simply learns from experience.  No model of the world is needed. 02/04/25 10
  • 11.
    Markove Decision Process (MDP) The mathematical approach for mapping a solution in reinforcement learning is called Markov Decision Process.  The following parameters are used to apply a solution using MDP:  Set of action, A  Set of states, S  Reward, R  Policy,   Value, V 02/04/25 11
  • 12.
    Markove Decision Process (MDP) Example: using shortest path problem.  Goal: Find the shortest path between node A and D with minimum possible cost.  Set of states are denoted by nodes i.e. {A, B, C, D}  Action is to traverse from one node to another {A->B, C->D}  Reward is the cost represent by each edge.  Policy is the path taken to reach the destination {A->C->D} 02/04/25 12 A B C D 15 -10 -20 30 10 50
  • 13.
    Q-Learning  Q-learning isa value-based reinforcement learning algorithm used to find the optimal action-selection policy using a Q function.  Example: Suppose a building with 5 rooms numbered 0–4.  An agent is placed in any one of the rooms and the goal is to reach outside the building (room 5).  Doors 1 and 4 lead into the building from room 5. A reward value is associated with each door. 02/04/25 13
  • 14.
    Q-Learning with Example Example: place an agent in any one of the rooms (0, 1, 2, 3, 4) and the goal is to reach outside the building (room 5). 02/04/25 14  5 Rooms in a building connected by doors.  Each room is numbered 0 through 4.  The outside of the building can be thought of as one big room (5).  Doors 1 and 4 leads into the building from room 5 (outside).
  • 15.
    Q-Learning with Example Let’s represent the rooms as a graph, each room as a node, and each door as a link. 02/04/25 15
  • 16.
    Q-Learning with Example Next step is to associate a reward value to each door. 02/04/25 16  Doors that lead directly to the goal have a reward of 100.  Doors not directly connected to the target room have zero reward.  Because doors are two-way, two arrows are assigned to each room.  Each arrows contains an instant reward value.
  • 17.
    Q-Learning with Example The terminology in Q-Learning includes the terms state and action.  Room (including room 5 ) represents a state.  Agent’s movement from one room to another represents an action.  A state is depicted as a node, while action represented as arrows. 02/04/25 17  Example (Agent traverse from room 2 to room 5).  Initial state = State 2,  State 2 -> State 3,  State 3 -> State (2, 1, 4),  State 4 -> State 5
  • 18.
    Q-Learning with Example We can put the sate diagram and the instant reward values into a reward table, matrix R. (-1 no connection, 0 connection and 100 connected to the goal) 02/04/25 18
  • 19.
    Q-Learning with Example Another matrix called the Q matrix is constructed that depicts the memory of what the agent has learned up until now from its experiences.  A Q-Table is simply a lookup table where we calculate the maximum expected future rewards for action at each state.  The rows of the Q-matrix represent the current state of the agent, while the columns represent the possible actions leading to the next state. 02/04/25 19
  • 20.
    Q-Learning with Example The Q-function uses the Bellman equation and takes two inputs: state (s) and action (a).  The formula for calculating the Q matrix is as follows: 02/04/25 20  First, the Q matrix is initialized as a zero matrix.  Taking the example of the room problem described above and make the initial state as Room 1.  Q (1,5) is calculated by taking Gamma ( ) value 0.8. 𝛾
  • 21.
    Q-Learning with Example From room 1 we can either go to room 3 or 5, let’s take room 5.  From room 5, calculate maximum Q value for this next state based on all possible actions: 02/04/25 21 1 3 5 1 4 5
  • 22.
    Q-Learning with Example Another example: let us start (initial state), i.e., state 3.  From room 3 we can go to room 1, 2, and 4, let’s take room 1.  From room 1, calculate maximum Q value: 02/04/25 22 3 1 4 2 3 5
  • 23.
    Q-Learning with Example The formula for calculating the Q matrix is as follows: 02/04/25 23  The Gamma parameter has a range of 0 to1 (0 <= Gamma < 1). If Gamma is closer to zero, the agent will tend to consider only immediate rewards, (exploitation) If Gamma is closer to one, the agent will consider future rewards greater weight. (exploration)
  • 24.
    Q-Learning with Example The Q-learning algorithm:  Set the gamma parameter and environment rewards in the reward matrix (R),  Initialize matrix Q to zero,  Select a random initial state and Set initial state=current state,  Select one among all possible actions for the current state,  Using this possible action, consider going to the next state,  Get the maximum Q value for this next state based on all possible actions,  Compute Q(state, action) using the above formula,  Repeat the above steps until current state=goal state.   02/04/25 24
  • 25.
  • 26.