16. Expectedfuturerewards
Any goal can be represented as a sum of intermediate rewards.
[ ∣ ] = [ + γ + + … ∣ ]∑
∞
t=0
γ
t
Rt St R0 R1 γ
2
R2 St
15
17. Tools
1. Policy:
2. Value function:
3. Model:
We have to pick at least 1 of the 3.
π(a|s)
Q(s, a)
(P, R)
16
18. Policy
A policy de nes how the agent behaves.
It takes as input a state and output an action.
It can be stochastic, or deterministic.
17
19. Valuefunction
A value function estimates how much reward the agent can achieve.
It takes as input a (state,action), and output values.
One for each possible action.
18
20. Model
A model is the Agent representation of the environment.
Takes as input a state and output (next_state,reward).
19
24. Repeat
1. Prediction: Compute the value of the expected reward from until the
terminal state.
2. Control: Act greedly with respect to the predicted values.
st
23
28. Updaterule
In rabbits, humans and machines we get the same algorithm:
while True:
Q[t] = Q[t-1] + alpha * (Q_target - Q[t-1])
27
29. Q-Learning[Watkins,1989]
The agent does not have a model of the environment.
Perform actions following a standard policy.
Predict using the target policy.
Which makes it an "o -policy", model-free method.
28
30. Lossfunction
Building on what we learn from the rabbit.
The learning goal is to minimize the following loss function:
Putting all together we get...
Q_target = r + gamma * np.argmax( Q(s, A))
Loss = 1/n * np.sum( (Q_target - Q(s,a))^2)
29
35. DeepMindideas
1. Di erent neural networks for Q and Q_target
2. Estimate Q_target using past experiences
3. Update Q_target every C steps
4. Clip rewards between -1 and 1
34
36. Network
Input: an image of shape [None, 42, 42, 4]
4 Conv2D 32 lters, 4x4 kernel
1 Hidden layer of size 256
1 Fully connected layer of size action_size
35