2. Reinforcement Learning
How can an agent learn behaviors when it doesn’t have a
supervisor to tell it how to perform?
This problem is called reinforcement learning.
Rewards:
Positive / Negative
3. Reinforcement Learning (cont.)
The goal is to get the agent to act in the world so as
to maximize its rewards.
The agent has to figure out what it did that made it
get the reward/punishment
5. Q Learning
AI can directly drive an optimal policy from its environment
without needing to create a model beforehand.
Q learning is model free learning technique that can be
used to find the optimal action selection policy using Q
function.
Q function gives the largest expected return achievable by
any policy π for each possible state action pair.
6. Reinforcement Learning
In reinforcement learning we want to obtain function Q(S, A) that predicts
best action A in a state S to get maximum cumulative reward.
cumulative reward 1 = Q(s1, a1) + Q(s2, a1) = -1 + 0 = -1
cumulative reward 2 = Q(s1, a2) + Q(s2, a2) = 1 + 0.5 = 1.5
cumulative reward 3 = Q(s1, a2) + Q(s2, a1) = 1 + 0 = 1
cumulative reward 4 = Q(s1, a1) + Q(s2, a2) = -1 + 0.5 = -0.5
maximum cumulative reward = max(cumulative reward 1, cumulative reward 2, cumulative
reward 3, cumulative reward 4) = 1.5
a1 a2
s1 -1 1
s2 0 0.5
7. How Q learning Works?
Initialize Q
Choose Action from Q
Calculate Reward
Take action
Update Q
10. Repeat:
s sensed state
If s is terminal then exit
a P(s) /* Choose action using policy
Perform a
Reactive Agent Algorithm using
Reinforcement Learning
11. Approaches
Learn policy directly– function mapping from states to actions
Q(S, A)
Where, Q = {s1,s2,s3,s4} and A = {a1,a2,a3}
Learn utility values for states (i.e., the value function)
If the outcome of performing an action at a state is deterministic, then the
agent can update the utility value U() of states:
◦ U(new state) = reward + U(old state)
12. Exploration / Exploitation policy
Wacky approach (exploration): act randomly in
hopes of eventually exploring entire environment
Greedy approach (exploitation): act to maximize
utility using current estimate
Reasonable balance: act more wacky (exploratory)
when agent has little idea of environment; more
greedy when the model is close to correct path.
13. Summary
Active area of research.
Reinforcement learning is applicable to game-playing, robot controllers,
others