3. DQN (Deep Q Network)
• Intuition:Using a Deep Learning function approximatorwith
weights 𝜃 as a Q network.
• Apply Q-updateson batches of past experience insteadof online:
• Experience Replay
• More data efficient
• Make data more stationary
• Use an older set of weights to compute the targets (target
network):
• Keeps the target function from changing too quickly.
6. Double DQN
• DQNs suffer from non uniform overestimations which
leads to inefficient function approximators and noise.
• DDQN evaluates the greedy policy according to the online
network but using the target network to estimate its
value.
• This systematicoverestimationintroduces a maximization
bias in learning. And since Q-learning involves
bootstrapping
• We use the Q’ for action selection and Q for action
evaluation.
Instead of
7. Advantage Function
• AdvantageValue:It defines how much better it is to take a specific action compared to the average, general
action at the given state.
• Decompose Q
• The evaluationof an action is based not only on how good the action is, but also how much better it
can be.
• The advantagefunctiongives the relativemeasure of the importance of each action.
8. Dueling Network
• Dueling Network Architecture: Dueling network is a single
Q network with two streams, one for state value function
and another for advantage function.
• The dueling architecture can learn which states are/aren’t
valuable, without having to learn the effect of each action
for each state. This is particularly useful in states where its
actions do not affect the environment in any relevant way.
9. Dueling Network
One stream is fully-connected layers output a scalar V(s;𝜃,β), and
the other stream output an |A|-dimensional vector A(s,a;𝜃,α).
Here, 𝜃 denotes the parameters of the convolutional layers, while
α and β are the parameters of the two streams of fully-connected
layers.
The estimates V(s;𝜃,β) and A(s,a;𝜃,α) are computed automatically
without any extra supervision or algorithmic modifications
10. Prioritized Replay
• The basic intuition of this is to give priority to the random samples
that are picked up from the buffer
• The priorities are based on TD Loss, higher the loss higher the priority
• This ensures efficient use of data
11. Multistep Learning
• Multi-step targets with suitably tuned n often lead to faster learning.
• The truncated n-step return from a given state St as
• Loss Function:
• Inspired from A3C
14. NoisyNets
• The noise is sampled from the fully connected layer after each iteration
• The level of noise required for different areas in the state space are automaticallytuned. Unlike the
epsilon greedy method.
Variance– 0.017Mean
15. Distributional RL
Questions to ask. Why the approximateexpectationof actions, why not approximate
distribution?
We create a new support for target dist, which minimizes the KL divergence between the
input dist and target dist.
16. Rainbow Implementation
Changed the 1-step distributional loss with a multi-step variant. So, the distribution looks like
instead of
Hence the Kullbeck-Leibler loss with double Q learning looks like
instead of
Prioritized Replay is implemented on KL loss
Upon which they used Dueling Network architecture replacing all linear layers with their noisy
equivalent
17. Ablation Results
• Prioritized Replay and Multi-Step Learning are the two most crucial
components of Rainbow
• Next in rank is the Distributional Q-Learning
• Noisy Nets had a positive effect on the overall performance, though its
removal caused increase in performance of some games while decreased in
others
• There was not much significant change that duelling brought into Rainbow
• Double Q-Learning also didn't show much change, its effect were harmful
in some cases and helpful in others based on the game dynamics