Lecture -10 AI Reinforcement Learning.ppt

Unit 5: Reinforcement
Learning
Adama Science and Technology University
School of Electrical Engineering and Computing
Department of CSE
Dr. Mesfin Abebe Haile (2021)

Outline
 Why Reinforcement Learning?
 Reinforcement Learning Process
 Reinforcement learning Definition
 Reinforcement Learning Concepts
 Understanding Q-Learning
02/04/25 2

Reinforcement Learning
 Machine Learning: is a subset of AI which provides machines the ability to learn automatically and improve from experience
without being explicitly programmed.
 Type of Machine Learning:
 Supervised Learning (label provided for every input)
 Unsupervised Learning (no label provided)
 Reinforcement Learning (classical AI and ML)
 Reinforcement learning: is a type of machine learning where an agent learns to behave in an environment by performing
actions and seeing the results.
02/04/25 3

 Example of RL:
 A robot cleaning a room,
 Game playing,
 Learning how to fly a helicopter,
 Scheduling planes to their destinations,
 And so on…
 Reinforcement learning with an analogy:
 Scenario1: Baby starts crawling and move into the candy. (candy)
 Scenario2: Baby starts crawling but due to some hurdle in between. (no candy)
02/04/25 4

Process
 Reinforcement learning system is composed of two main components:
 Agent
 Environment
02/04/25 5
 The action influences the state of the world which determines its
reward.

Process
 But, we don’t know anything about the environment model—the
transition function T(s,a,s’).
 Here comes two approaches:
 Model based approach RL:
 Learn the model, and use it to derive the optimal policy.
 E.g Adaptive dynamic learning (ADP) approach
 Model free approach RL:
 Derive the optimal policy without learning the model.
 E.g LMS and Temporal difference approach
 Which one is better?
02/04/25 6

Process
 Reinforcement learning system has complications:
The outcome of your actions may be uncertain,
 You may not be able to perfectly sense the state of the world,
 The reward may be stochastic,
 Reward is delayed (i.e. finding food in a maze),
 You may have no clue (model) about how the world responds to
your actions,
The world may change while you try to learn it,
 How much time do you need to explore uncharted territory before
you exploit what you have learned?
02/04/25 7

Definitions
 Agent: The RL algorithm that learns from trial and error.
 Environment: The world through which the agent moves.
 Action (A): All the possible steps that the agent can take.
 State (S): Current condition returned by the environment.
 Reward (R): An instant return from the environment to appraise the last action.
02/04/25 8

Definitions
 Policy (): The approach that the agent uses to determine the next action based on the current state.
 Values (V): The expected long-term return with discount, as opposed to the short-term reward R.
 Action-value (Q): This is similar to Value, except, it takes an extra parameter, the current action (A).
02/04/25 9

Concept
 Reward maximization theory states that, a RL agent must be trained in such a way that he takes the best action so that the
reward is maximum.
 Exploitation: is about using the already known exploited information to intensify the rewards.
 Exploration: is about exploring and capturing more information about an environment.
 Q-learning is used, which simply learns from experience.
 No model of the world is needed.
02/04/25 10

Markove Decision Process
(MDP)
 The mathematical approach for mapping a solution in reinforcement learning is called Markov Decision Process.
 The following parameters are used to apply a solution using MDP:
 Set of action, A
 Set of states, S
 Reward, R
 Policy, 
 Value, V
02/04/25 11

Markove Decision Process
(MDP)
 Example: using shortest path problem.
 Goal: Find the shortest path between node A and D with minimum possible cost.
 Set of states are denoted by nodes i.e. {A, B, C, D}
 Action is to traverse from one node to another {A->B, C->D}
 Reward is the cost represent by each edge.
 Policy is the path taken to reach the destination {A->C->D}
02/04/25 12
A
B
C
D
15
-10
-20
30
10
50

Q-Learning
 Q-learning is a value-based reinforcement learning algorithm used to find the
optimal action-selection policy using a Q function.
 Example: Suppose a building with 5 rooms numbered 0–4.
 An agent is placed in any one of the rooms and the goal is to reach outside the
building (room 5).
 Doors 1 and 4 lead into the building from room 5. A reward value is associated
with each door.
02/04/25 13

Q-Learning with Example
 Example: place an agent in any one of the rooms (0, 1, 2, 3, 4) and the goal is to reach outside the building (room 5).
02/04/25 14
 5 Rooms in a building connected
by doors.
 Each room is numbered 0 through
4.
 The outside of the building can be
thought of as one big room (5).
 Doors 1 and 4 leads into the
building from room 5 (outside).

 Let’s represent the rooms as a graph, each room as a node, and each door as a link.
02/04/25 15

 Next step is to associate a reward value to each door.
02/04/25 16
 Doors that lead directly to the goal
have a reward of 100.
 Doors not directly connected to
the target room have zero reward.
 Because doors are two-way, two
arrows are assigned to each room.
 Each arrows contains an instant
reward value.

 The terminology in Q-Learning includes the terms state and action.
 Room (including room 5 ) represents a state.
 Agent’s movement from one room to another represents an action.
 A state is depicted as a node, while action represented as arrows.
02/04/25 17
 Example (Agent traverse from
room 2 to room 5).
 Initial state = State 2,
 State 2 -> State 3,
 State 3 -> State (2, 1, 4),
 State 4 -> State 5

 We can put the sate diagram and the instant reward values into a reward table, matrix R. (-1 no connection, 0 connection and 100 connected to the goal)
02/04/25 18

 Another matrix called the Q matrix is constructed that depicts the memory of what the agent has
learned up until now from its experiences.
 A Q-Table is simply a lookup table where we calculate the maximum expected future rewards for
action at each state.
 The rows of the Q-matrix represent the current state of the agent, while the columns represent the
possible actions leading to the next state.
02/04/25 19

 The Q-function uses the Bellman equation and takes two inputs: state (s) and action (a).
 The formula for calculating the Q matrix is as follows:
02/04/25 20
 First, the Q matrix is initialized as a zero matrix.
 Taking the example of the room problem described above and
make the initial state as Room 1.
 Q (1,5) is calculated by taking Gamma ( ) value 0.8.
𝛾

 From room 1 we can either go to room 3 or 5, let’s take room 5.
 From room 5, calculate maximum Q value for this next state based on all possible actions:
02/04/25 21
1
3
5
1
4
5

 Another example: let us start (initial state), i.e., state 3.
 From room 3 we can go to room 1, 2, and 4, let’s take room 1.
 From room 1, calculate maximum Q value:
02/04/25 22
3
1
4
2
3
5

 The formula for calculating the Q matrix is as follows:
02/04/25 23
 The Gamma parameter has a range of 0 to1 (0 <= Gamma < 1).
If Gamma is closer to zero, the agent will tend to consider only
immediate rewards, (exploitation)
If Gamma is closer to one, the agent will consider future rewards
greater weight. (exploration)

 The Q-learning algorithm:
 Set the gamma parameter and environment rewards in the reward matrix (R),
 Initialize matrix Q to zero,
 Select a random initial state and Set initial state=current state,
 Select one among all possible actions for the current state,
 Using this possible action, consider going to the next state,
 Get the maximum Q value for this next state based on all possible actions,
 Compute Q(state, action) using the above formula,
 Repeat the above steps until current state=goal state.


02/04/25 24

Lecture -10 AI Reinforcement Learning.ppt

More Related Content

Similar to Lecture -10 AI Reinforcement Learning.ppt

More from gadisaAdamu

Recently uploaded

Lecture -10 AI Reinforcement Learning.ppt