FeUdal Networks for Hierarchical Reinforcement Learning

•Download as PPTX, PDF•

2 likes•1,411 views

Presentation of paper by DeepMind "FeUdal Networks for Hierarchical Reinforcement Learning" at SF Reading Group. Join Slack https://xixslack.herokuapp.com/ to discuss and Meetup https://www.meetup.com/superintelligencemeetup/ to participate.

Software

FeUdal Networks for
Hierarchical Reinforcement
Learning
Illia Polosukhin
Paper by Vezhnevets et al.

Motivation
● Deep Reinforcement Learning works really well when
rewards occur often
● Environments with long-term credit assignment and
sparse rewards are still a challenge
● Non-Markovian environments, that require memory -
particularly challenging
● Non-hierarchical models often overfit specific mapping of
input-outputs.

Feudal Reinforcement Learning
● Managerial hierarchy
observing world at different
resolution [Information
Hiding]
● Communicate via goals to
manager’s “workers” and
rewarding for meeting
them. [Reward Hiding]
Dayan & Hinton, 1993

Contributions
● FuNs: End-to-end differentiable model that implements
principles of Feudal RL [Dayan & Hinton, 1993]
● Novel, approximate transition policy gradient update for
training Manager
● Use of goals that are directional rather than absolute in
nature
● A novel dilated LSTM to extend longevity of memory for
Manager

Goal embedding
● Worker produces embedding for each action - matrix U.
● Last c goals from Manager are summed and projected
into vector w (Rk)
● Manager’s goal w modulates policy via a multiplicative
interaction in low k dim space.

Training
● Manager training to set goals in the advantageous
direction in state space:
● Worker trained intrinsic reward to follow Manager’s goals:

Training
● Using Actor-Critic setup for Worker training, using
weighted sum of an intrinsic reward and environment
reward for Advantage function:

Transition Policy Gradients
● Manager can be trained as if it had high-level policy, that
selects sub-policies ot
● High-level policy can be composed with the transition
distribution to give “transition policy” and can be applied
policy gradient to it:

Dilated LSTM
● Given dilation radius r, the network full state h -
combination of {hi}r
i=1 sub-states or “cores”
● LSTM at time t only uses and updates t % r core - ht%r
t-1,
while sharing parameters
● Output is pooled across previous c outputs.
● Allows to preserve the memories for long periods, and still
process from every input experience and update output at
every step.

Join Slack:
https://xixslack.herokuapp.com/
Quick survey about tools for Machine Learning:
http://bit.ly/ml-tools
Really, just a minute!

Similar to FeUdal Networks for Hierarchical Reinforcement Learning

[243] Deep Learning to help student’s Deep LearningNAVER D2

Learning transfer FULL PRESENTATIONArjun Reghu

RajeevKumarKaqua25ind

Paper presentation on LLM compression SanjanaRajeshKothari

On the road to Engineering excellenceAlexander Mrynskyi

20482-Sathyanarayana-FPM Assignment 1.pptxVarunSubramanyam

AI_Unit-4_Learning.pptxMohammadAsim91

Enterprise transformation models their utility, common pitfalls and adaptive ITPuppet

Rajmohan_CV _UpdatedRajmohan A

PerformanceG2 Cognos Training Course Catalog 2011PerformanceG2, Inc.

Learning to Learn by Gradient Descent by Gradient DescentKaty Lee

Deep Q-learning from Demonstrations DQfDAmmar Rashed

Yogananda-SAPSFYogananda Reddy Kareti

Deep Reinforcement learningCairo University

Best Practices from19 ERP ImplementationsThomas Danford

RkresumeRaushan Kumar

Reinforcement Learning 4. Dynamic ProgrammingSeung Jae Lee

Enhancing Financial Sentiment Analysis via Retrieval Augmented Large Language...patiladiti752

Reinforcement learningDongHyun Kwak

Introduction of Deep Reinforcement LearningNAVER Engineering

Similar to FeUdal Networks for Hierarchical Reinforcement Learning (20)

[243] Deep Learning to help student’s Deep Learning

Learning transfer FULL PRESENTATION

RajeevKumarK

Paper presentation on LLM compression

On the road to Engineering excellence

20482-Sathyanarayana-FPM Assignment 1.pptx

AI_Unit-4_Learning.pptx

Enterprise transformation models their utility, common pitfalls and adaptive IT

Rajmohan_CV _Updated

PerformanceG2 Cognos Training Course Catalog 2011

Learning to Learn by Gradient Descent by Gradient Descent

Deep Q-learning from Demonstrations DQfD

Yogananda-SAPSF

Deep Reinforcement learning

Best Practices from19 ERP Implementations

Rkresume

Reinforcement Learning 4. Dynamic Programming

Enhancing Financial Sentiment Analysis via Retrieval Augmented Large Language...

Reinforcement learning

Introduction of Deep Reinforcement Learning

Recently uploaded

Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)jennyeacort

Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies

Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase

Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz

2.pdf Ejercicios de programación competitivaDiego Iván Oliveros Acosta

Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ

What is Advanced Excel and what are some best practices for designing and cre...Technogeeks

React Server Component in Next.js by Hanief UtamaHanief Utama

SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl

Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions

KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app

SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa

Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran

Xen Safety Embedded OSS Summit April 2024 v4.pdfStefano Stabellini

Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed

Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH

BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp

What is Fashion PLM and Why Do You Need ItWave PLM

How to Track Employee Performance A Comprehensive Guide.pdfLivetecs LLC

Recently uploaded (20)

Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)

Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service

CRM Contender Series: HubSpot vs. Salesforce

Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024

Folding Cheat Sheet #4 - fourth in a series

2.pdf Ejercicios de programación competitiva

Cloud Management Software Platforms: OpenStack

What is Advanced Excel and what are some best practices for designing and cre...

React Server Component in Next.js by Hanief Utama

SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany

Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...

KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx

SpotFlow: Tracking Method Calls and States at Runtime

Intelligent Home Wi-Fi Solutions | ThinkPalm

Xen Safety Embedded OSS Summit April 2024 v4.pdf

Unveiling Design Patterns: A Visual Guide with UML Diagrams

Der Spagat zwischen BIAS und FAIRNESS (2024)

BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE

What is Fashion PLM and Why Do You Need It

How to Track Employee Performance A Comprehensive Guide.pdf

FeUdal Networks for Hierarchical Reinforcement Learning

1. FeUdal Networks for Hierarchical Reinforcement Learning Illia Polosukhin Paper by Vezhnevets et al.

2. Motivation ● Deep Reinforcement Learning works really well when rewards occur often ● Environments with long-term credit assignment and sparse rewards are still a challenge ● Non-Markovian environments, that require memory - particularly challenging ● Non-hierarchical models often overfit specific mapping of input-outputs.

3. Feudal Reinforcement Learning ● Managerial hierarchy observing world at different resolution [Information Hiding] ● Communicate via goals to manager’s “workers” and rewarding for meeting them. [Reward Hiding] Dayan & Hinton, 1993

4. Contributions ● FuNs: End-to-end differentiable model that implements principles of Feudal RL [Dayan & Hinton, 1993] ● Novel, approximate transition policy gradient update for training Manager ● Use of goals that are directional rather than absolute in nature ● A novel dilated LSTM to extend longevity of memory for Manager

5. Model

6. Model

7. Goal embedding ● Worker produces embedding for each action - matrix U. ● Last c goals from Manager are summed and projected into vector w (Rk) ● Manager’s goal w modulates policy via a multiplicative interaction in low k dim space.

8. Training ● Manager training to set goals in the advantageous direction in state space: ● Worker trained intrinsic reward to follow Manager’s goals:

9. Training ● Using Actor-Critic setup for Worker training, using weighted sum of an intrinsic reward and environment reward for Advantage function:

10. Transition Policy Gradients ● Manager can be trained as if it had high-level policy, that selects sub-policies ot ● High-level policy can be composed with the transition distribution to give “transition policy” and can be applied policy gradient to it:

11. Dilated LSTM ● Given dilation radius r, the network full state h - combination of {hi}r i=1 sub-states or “cores” ● LSTM at time t only uses and updates t % r core - ht%r t-1, while sharing parameters ● Output is pooled across previous c outputs. ● Allows to preserve the memories for long periods, and still process from every input experience and update output at every step.

12. Experiments

13.

14.

15.

16.

17. Ablative analysis

18. Action repeat transfer

19. Join Slack: https://xixslack.herokuapp.com/ Quick survey about tools for Machine Learning: http://bit.ly/ml-tools Really, just a minute!

Editor's Notes

It is symptomatic that the standard approach on the ATARI benchmark suite (Bellemare et al., 2012) is to use an actionrepeat heuristic, where each action translates into several (usually 4)
No biases makes sure there is no way to produce constant non-zero vector. Due to pooling, the conditioning from Manager varies smoothly
We use directions because it is more feasible for the Worker to be able to reliably cause directional shifts in the latent state than it is to assume that the Worker can take us to (potentially) arbitrary new absolute locations.
Learning curve on Montezuma’s Revenge
This is a visualisation of sub-goals learnt by FuN in the first room. Tall bars - number of states for which current state maximized the cos(s - st, gt)
Visualisation of sub-policies learnt on sea quest game.
Ablative analysis: Non feudal FuN: training policy gradient with gradient going via g from Worker and no intrinsic reward. Manager’s g trained via standard policy gradient G is absolute goal instead of direction. Pure feudal: worker has only intrinsic reward
Testing separation between worker and manager: Initialize on agent that was trained with action repeat = 4 on environment without action repeat. Increase dilation by 4, manager’s horizon c by 4. Train for 200 episodes.

FeUdal Networks for Hierarchical Reinforcement Learning

Recommended

Recommended

More Related Content

Similar to FeUdal Networks for Hierarchical Reinforcement Learning

Similar to FeUdal Networks for Hierarchical Reinforcement Learning (20)

Recently uploaded

Recently uploaded (20)

FeUdal Networks for Hierarchical Reinforcement Learning

Editor's Notes