Presentation of paper by DeepMind "FeUdal Networks for Hierarchical Reinforcement Learning" at SF Reading Group.
Join Slack https://xixslack.herokuapp.com/ to discuss and Meetup https://www.meetup.com/superintelligencemeetup/ to participate.
2. Motivation
● Deep Reinforcement Learning works really well when
rewards occur often
● Environments with long-term credit assignment and
sparse rewards are still a challenge
● Non-Markovian environments, that require memory -
particularly challenging
● Non-hierarchical models often overfit specific mapping of
input-outputs.
3. Feudal Reinforcement Learning
● Managerial hierarchy
observing world at different
resolution [Information
Hiding]
● Communicate via goals to
manager’s “workers” and
rewarding for meeting
them. [Reward Hiding]
Dayan & Hinton, 1993
4. Contributions
● FuNs: End-to-end differentiable model that implements
principles of Feudal RL [Dayan & Hinton, 1993]
● Novel, approximate transition policy gradient update for
training Manager
● Use of goals that are directional rather than absolute in
nature
● A novel dilated LSTM to extend longevity of memory for
Manager
7. Goal embedding
● Worker produces embedding for each action - matrix U.
● Last c goals from Manager are summed and projected
into vector w (Rk)
● Manager’s goal w modulates policy via a multiplicative
interaction in low k dim space.
8. Training
● Manager training to set goals in the advantageous
direction in state space:
● Worker trained intrinsic reward to follow Manager’s goals:
9. Training
● Using Actor-Critic setup for Worker training, using
weighted sum of an intrinsic reward and environment
reward for Advantage function:
10. Transition Policy Gradients
● Manager can be trained as if it had high-level policy, that
selects sub-policies ot
● High-level policy can be composed with the transition
distribution to give “transition policy” and can be applied
policy gradient to it:
11. Dilated LSTM
● Given dilation radius r, the network full state h -
combination of {hi}r
i=1 sub-states or “cores”
● LSTM at time t only uses and updates t % r core - ht%r
t-1,
while sharing parameters
● Output is pooled across previous c outputs.
● Allows to preserve the memories for long periods, and still
process from every input experience and update output at
every step.
It is symptomatic that the standard approach on the ATARI benchmark suite (Bellemare et al., 2012) is to use an actionrepeat heuristic, where each action translates into several (usually 4)
No biases makes sure there is no way to produce constant non-zero vector.
Due to pooling, the conditioning from Manager varies smoothly
We use directions because it is more feasible for the Worker to be able to reliably cause directional shifts in the latent state than it is to assume that the Worker can take us to (potentially) arbitrary new absolute locations.
Learning curve on Montezuma’s Revenge
This is a visualisation of sub-goals learnt by FuN in the first room. Tall bars - number of states for which current state maximized the cos(s - st, gt)
Visualisation of sub-policies learnt on sea quest game.
Ablative analysis:
Non feudal FuN: training policy gradient with gradient going via g from Worker and no intrinsic reward.
Manager’s g trained via standard policy gradient
G is absolute goal instead of direction.
Pure feudal: worker has only intrinsic reward
Testing separation between worker and manager:
Initialize on agent that was trained with action repeat = 4 on environment without action repeat. Increase dilation by 4, manager’s horizon c by 4. Train for 200 episodes.