Diversity is All You Need :
Learning Skills without a Reward Function
김예찬(Paul Kim)
Index
1. Abstract
2. Introduction
3. Related Work
4. Diversity is All You Need
4.1 How it Works
4.2 Implementation
5. What Skills are Learned?
6. Harnessing Learned Skills
6.1 Adapting Skills to Maximize Reward
6.2 Using Skills for Hierachical RL
6.3 Imitation an Expert
7. Conclusion
Abstract
1. Abstract
DIAYN(Diversity is All You Need)
- Agent can explore their environment and learn useful skills witho
ut supervision(감독)
- DIYAN can learning usefull sklls without a reward function
- maximum entropy policy을 활용하며 information theoretic를 m
aximizing하는 방식
- DIAYN을 효과적인 pretraining 방법론 제시함. exploration과 data
efficiency측면에서 RL의 문제를 극복
Introduction
2. Introduction
DRL has been demonstrated to effectively learn a wide range of re
ward driven skills, including
1. play games
2. controlling robots
3. navigation
2. Introduction
DRL has been demonstrated to effectively learn a wide range of re
ward driven skills, including
1. play games
2. controlling robots
3. navigation
DIAYN
Not Reward
Driven
2. Introduction
DIAYN : Unsupervised skill discovery
- Learning usefull skills without supervision은 spares reward ta
sk인 경우 exploration을 하는데 도움을 줄 수 있음
- For long horizon tasks, skills discovered without reward can serv
e as primitives for HRL, effectively shortening the episode length
- human feedback : ex) reward design
- reward function을 design하는데 많은 시간을 투자할 필요가 없
음
2. Introduction
What is Skill?
- Skill은 환경의 state를 consistent way(일관된 방식)으로 변경시키
는 policy임
- skills might be useless
- skills are not only distinguishable, but also are as diverse as p
ossible
- Diverse skills are robust to perturbations and better exploring
the environment
2. Introduction
핵심 아이디어
distinguishable하며 diversity한 skill들을 습득하자
- object based on mutual information
- application : HRL, imitation Learning
2. Introduction
Contribution 5가지
1. method for learning useful skills without any rewards
- maximizing an information theoretic, maximum entropy policy
2. simple exploration objective results in the unsupervised emerge
nce skills
- (running, jumping), some of learned skills solve task..
3. simple method for using learned skills for HRL and find this met
hods solves tasks
4. how skills discovered can be quickly adapted to solve new task
5. skills discovered can be used for imitation learning
2. Introduction
Related Work
3. Related Work
HRL Perspective
Previous work
- HRL has learned skills to maximize a single, known, reward f
unction by jointly learning a set of skills and meta-controller
- in joint training, meta-policy does not select ‘bad’ options, so t
hese options do not receive any reward signal to improve
DIAYN특징
- random meta-policy를 제시
- learns skills with no reward
3. Related Work
Connection between RL and information theory
Previous work
- mutual information between states and actions as a notion of e
mpowerment for an intrinsically motivated agent
- discriminability objective is equivalent to maximizing the mutu
al information between latent skill z and some aspect of the corres
ponding trajectory
- setting with many tasks, and reward function
- setting with a single task reward
3. Related Work
Connection between RL and information theory
Previous work
- mutual information between states and actions as a notion of e
mpowerment for an intrinsically motivated agent
- discriminability objective is equivalent to maximizing the mutu
al information between latent skill $z$ and some aspect of the corr
esponding trajectory
- setting with many tasks, and reward function
- setting with a single task reward
DIAYN특징
- maximize the mutual information between states and skills(
can be interpreted as maximizing the empowerment of a hierarc
hical agent whoes action space is the set of skills)
3. Related Work
Connection between RL and information theory
DIAYN특징
- maximum entropy policies to force skill to be diverse
- fix the distribution p(z) rather than learning it, preventing p(z) fr
om collapsing to sampling only handful of skills.
- discriminator looks at every state, which provides additional rew
ard signal
3. Related Work
Neuroevolution and evolutionary algorithms
- neuroevolution and evolutionary algorithms has studied how com
plex behaviors can be learned by directly maximizing diversity
DIAYN특징
- acquire complex skills with minimal supervision to improve efficie
ncy
- focus on deriving a general, information theoretic objective that
does not require manual design of distance metrics and can be a
pplied to any RL task without additional engineering
3. Related Work
Intrinsic motivation
- previous works use an intrinsic motivation objective to learn a
single policy
DIAYN특징
- propose an objective for learning many, diverse policies
Diversity is
All You Need
4. Diversity is All You Need
Unsupervised RL paradigm
- agent is allowed an unsupervised “exploration” stage followed b
y a supervised stage
- the aim of the unsupervised stage is to learn skills that eventu
ally will make it easier to maximize the task reward in the super
vised stage.
- Conveniently, because skills are learned without a priori knowled
ge of the task, the learned skills can be used for many different tas
ks
Maximize a mixture of policies (the collection of skills together wi
th p(z))
4. Diversity is All You Need
Unsupervised RL paradigm
- agent is allowed an unsupervised “exploration” stage followed by
a supervised stage
- the aim of the unsupervised stage is to learn skills that eventually
will make it easier to maximize the task reward in the supervised s
tage.
- Conveniently, because skills are learned without a priori knowled
ge of the task, the learned skills can be used for many different tas
ks
Unsupervised and Supervised
- the agent explores the environment, but does
not receive any task reward
Learn Skills
4. Diversity is All You Need
Unsupervised RL paradigm
- agent is allowed an unsupervised “exploration” stage followed by
a supervised stage
- the aim of the unsupervised stage is to learn skills that eventually
will make it easier to maximize the task reward in the supervised s
tage.
- Conveniently, because skills are learned without a priori knowled
ge of the task, the learned skills can be used for many different tas
ks
Unsupervised and Supervised
- the agent receives the task reward, and its go
al is to learn the task by maximizing the task r
eward
Maximize the
task reward
4.1 How it Works?
DIAYN : three ideas
1. The skill dictates the states that the agent visits
2. To distinguish skills, we use states not actions
3. The skills should be as diverse as possible
4.1 How it Works?
DIAYN : three ideas
1. The skill dictates the states that the agent visits
4.1 How it Works?
DIAYN : three ideas
1. The skill dictates the states that the agent visits
Maximize Mutual Information between skills and states
- also skill should control with states the agent visit
MI(s, z)
4.1 How it Works?
DIAYN : three ideas
1. The skill dictates the states that the agent visits
- maximize the mutual information between skills and states, MI(s, z)
4.1 How it Works?
DIAYN : three ideas
1. The skill dictates the states that the agent visits
- maximize the mutual information between skills and states, MI(s, z)
2. To distinguish skills, we use states not actions
4.1 How it Works?
DIAYN : three ideas
1. The skill dictates the states that the agent visits
- maximize the mutual information between skills and states, MI(s, z)
2. To distinguish skills, we use states not actions
To ensure that states, not action, are used to distinguish skills,
we minimize the mutual information between skills and actions
given the state, MI(a, z | s)
4.1 How it Works?
DIAYN : three ideas
1. The skill dictates the states that the agent visits
- maximize the mutual information between skills and states, MI(s, z)
2. To distinguish skills, we use states not actions
- minimize the mutual information between skills and actions given the state, MI(a, z | s)
4.1 How it Works?
DIAYN : three ideas
1. The skill dictates the states that the agent visits
- maximize the mutual information between skills and states, MI(s, z)
2. To distinguish skills, we use states not actions
- minimize the mutual information between skills and actions given the state, MI(a, z | s)
3. The skills should be as diverse as possible
4.1 How it Works?
DIAYN : three ideas
1. The skill dictates the states that the agent visits
- maximize the mutual information between skills and states, MI(s, z)
2. To distinguish skills, we use states not actions
- minimize the mutual information between skills and actions given the state, MI(a, z | s)
3. The skills should be as diverse as possible
Maximize a mixture of policies (the collection of skills together
with p(z))
4.1 How it Works?
DIAYN : three ideas
1. The skill dictates the states that the agent visits
- maximize the mutual information between skills and states, MI(s, z)
2. To distinguish skills, we use states not actions
- minimize the mutual information between skills and actions given the state, MI(a, z | s)
3. The skills should be as diverse as possible
- maximize a mixture of poilicies (the collection of skills together with p(z))
4.1 How it Works?
4.2 Implementation
- Uses soft actor critic to learn a policy
- Entropy regularizer is scaled by alpha
- found empirically 0.01
- trade off between exploration and discriminability
- Uses a pseudo-reward r_z to maximize the entropy
4.2 Implementation
What Skills
are Learned
5. What skills are Learned?
1. Does entropy regularization lead to more diverse skills?
- small alpha, learns skills that move large distances in directions
but fail to explore large parts of the state space
- increasing alpha, the skills visit a more diverse set of states, whi
ch may help with exploration in complex state space
- It is difficult to discriminate skills when alpha is further increas
ed
orientation, forward velocity
5. What skills are Learned?
2. How does the distribution of skills change during training
- inverted pendulum and mountain car become increasingly divers
e throughout training
- skills are learned with no reward, so it is natural that some skills
correspond to small task reward while others correspond to large t
ask reward
5. What skills are Learned?
3. Does DIAYN explore effectively in complex environment?
- half-cheetah, hopper, and ant
- learn diverse locomotion primitives
5. What skills are Learned?
3. Does DIAYN explore effectively in complex environment?
- evaluate all skills on three reward functions:
running (maximize X coordinate), jumping (maximize Z coordinate)
moving (maximize L2 distance from origin)
- DIAYN learns some skills that achieve high reward
- DIAYN optimizes a collection of policies, which enables more diver
se exploration.
5. What skills are Learned?
4. Does DIAYN ever learn skills that solve a benchmark task?
- half cheetah and hopper learns skills that run and hop forward q
uickly => good
Harnessing
Learned Skills
6. Harnessing Learned Skills
Three perhaps less obvious applications are adapting skills to
1. maximize a reward
2. hierarchical RL
3. imitation learning
6.1 Adapting Skills to Maximize Reward
- After DIAYN learns task-agnostic skills without supervision, we c
an quickly adapt the skills can to solve a desired task
- Akin to computer vision researchers using models pre-trained o
n ImageNet
- DIAYN as (unsupervised) pre-training in resource-constrained
settings
6.1 Adapting Skills to Maximize Reward
5. Can we use learned skills to directly maximize the task rewa
rd?
- approach differs from this baseline only in how weights are
initialized => good
6.2 Using Skills for Hierarchical RL
- In theory, hierarchical RL should decompose a complex task in
to motion primitives, which may be reused for multiple tasks
- In practice, algorithms for hierarchical RL encounter many difficul
ties:
1. each motion primitive reduces to a single action
2. the hierarchical policy only samples a single
motion primitive
3. all motion primitives attempt to do the entire task
6.2 Using Skills for Hierarchical RL
- In theory, hierarchical RL should decompose a complex task in
to motion primitives, which may be
reused for multiple tasks
- In practice, algorithms for hierarchical RL encounter many difficul
ties:
1. each motion primitive reduces to a single action [9]
2. the hierarchical policy only samples a single
mo tion primitive [24]
3. all motion primitives attempt to do the entire
task
DIAYN discovers diverse, task-agnostic skills, which hold the
promise of acting as a building block for hierarchical RL
6.2 Using Skills for Hierarchical RL
6. Are skills discovered by DIAYN useful for hierarchical RL?
- how DIAYN outperforms all baselines. TRPO and SAC are comp
etitive on-policy and off-policy RL algorithms, while VIME includes
an auxiliary objective to promote efficient exploration
6.2 Using Skills for Hierarchical RL
7. How can DIAYN leverage prior knowledge about what skills
will be useful?
- In particular, we can condition the discriminator on only a subse
t of the observation, forcing DIAYN to find skills that are divers
e in this subspace (but potentially indistinguishable along other d
imensions)
6.3 Imitating an Expert
8. Can we use learned skills to imitate an expert?
- consider the setting where we are given an expert trajectory con
sisting of states (not actions)
6.3 Imitating an Expert
8. Can we use learned skills to imitate an expert?
- Given the expert trajectory, we use our learned discriminator
to estimate which skill was most likely to have generated the tra
jectory
- this optimization problem, which we can solve for categorical z
by simple enumerate, is equivalent to an M-projection
6.3 Imitating an Expert
9. How does DIAYN differ from Variational Intrinsic Control?
- maximum entropy policies and not learn the prior p(z)
- found that DIAYN method consistently matched the expert trajectory
more closely than VIC baselines without these elements
- the ABC distribution over skills, p(z) is learned, the model may encount
er a rich-get-richer problem
Conclustion
7. Conclution
- In this paper, DIAYN, a method for learning skills without rewar
d functions
- DIAYN learns diverse skills for complex tasks, often solving ben
chmark tasks with one of the learned skills without actually receivi
ng any task reward
7. Conclution
- proposed methods for using the learned skills
(1) to quickly adapt to a new task
(2) to solve complex tasks via hierarchical RL
(3) to imitate an expert
- As a rule of thumb, DIAYN may make learning a task easier by r
eplacing the task’s complex action space with a set of useful
skills
- DIAYN could be combined with methods for augmenting the obs
ervation space and reward function
7. Conclution
- Using the common language of information theory, joint objecti
ve can likely be derived.
- DIAYN may also more efficiently learn from human preferences
by having humans select among learned skills
- Finally, for creativity and education, the skills produced by DIAYN
might be used by game designers to allow players to control comp
lex robots and by artists to design dancing robots.
Thank you

Diversity is all you need(DIAYN) : Learning Skills without a Reward Function

  • 1.
    Diversity is AllYou Need : Learning Skills without a Reward Function 김예찬(Paul Kim)
  • 2.
    Index 1. Abstract 2. Introduction 3.Related Work 4. Diversity is All You Need 4.1 How it Works 4.2 Implementation 5. What Skills are Learned? 6. Harnessing Learned Skills 6.1 Adapting Skills to Maximize Reward 6.2 Using Skills for Hierachical RL 6.3 Imitation an Expert 7. Conclusion
  • 3.
  • 4.
    1. Abstract DIAYN(Diversity isAll You Need) - Agent can explore their environment and learn useful skills witho ut supervision(감독) - DIYAN can learning usefull sklls without a reward function - maximum entropy policy을 활용하며 information theoretic를 m aximizing하는 방식 - DIAYN을 효과적인 pretraining 방법론 제시함. exploration과 data efficiency측면에서 RL의 문제를 극복
  • 5.
  • 6.
    2. Introduction DRL hasbeen demonstrated to effectively learn a wide range of re ward driven skills, including 1. play games 2. controlling robots 3. navigation
  • 7.
    2. Introduction DRL hasbeen demonstrated to effectively learn a wide range of re ward driven skills, including 1. play games 2. controlling robots 3. navigation DIAYN Not Reward Driven
  • 8.
    2. Introduction DIAYN :Unsupervised skill discovery - Learning usefull skills without supervision은 spares reward ta sk인 경우 exploration을 하는데 도움을 줄 수 있음 - For long horizon tasks, skills discovered without reward can serv e as primitives for HRL, effectively shortening the episode length - human feedback : ex) reward design - reward function을 design하는데 많은 시간을 투자할 필요가 없 음
  • 9.
    2. Introduction What isSkill? - Skill은 환경의 state를 consistent way(일관된 방식)으로 변경시키 는 policy임 - skills might be useless - skills are not only distinguishable, but also are as diverse as p ossible - Diverse skills are robust to perturbations and better exploring the environment
  • 10.
    2. Introduction 핵심 아이디어 distinguishable하며diversity한 skill들을 습득하자 - object based on mutual information - application : HRL, imitation Learning
  • 11.
    2. Introduction Contribution 5가지 1.method for learning useful skills without any rewards - maximizing an information theoretic, maximum entropy policy 2. simple exploration objective results in the unsupervised emerge nce skills - (running, jumping), some of learned skills solve task.. 3. simple method for using learned skills for HRL and find this met hods solves tasks 4. how skills discovered can be quickly adapted to solve new task 5. skills discovered can be used for imitation learning
  • 12.
  • 13.
  • 14.
    3. Related Work HRLPerspective Previous work - HRL has learned skills to maximize a single, known, reward f unction by jointly learning a set of skills and meta-controller - in joint training, meta-policy does not select ‘bad’ options, so t hese options do not receive any reward signal to improve DIAYN특징 - random meta-policy를 제시 - learns skills with no reward
  • 15.
    3. Related Work Connectionbetween RL and information theory Previous work - mutual information between states and actions as a notion of e mpowerment for an intrinsically motivated agent - discriminability objective is equivalent to maximizing the mutu al information between latent skill z and some aspect of the corres ponding trajectory - setting with many tasks, and reward function - setting with a single task reward
  • 16.
    3. Related Work Connectionbetween RL and information theory Previous work - mutual information between states and actions as a notion of e mpowerment for an intrinsically motivated agent - discriminability objective is equivalent to maximizing the mutu al information between latent skill $z$ and some aspect of the corr esponding trajectory - setting with many tasks, and reward function - setting with a single task reward DIAYN특징 - maximize the mutual information between states and skills( can be interpreted as maximizing the empowerment of a hierarc hical agent whoes action space is the set of skills)
  • 17.
    3. Related Work Connectionbetween RL and information theory DIAYN특징 - maximum entropy policies to force skill to be diverse - fix the distribution p(z) rather than learning it, preventing p(z) fr om collapsing to sampling only handful of skills. - discriminator looks at every state, which provides additional rew ard signal
  • 18.
    3. Related Work Neuroevolutionand evolutionary algorithms - neuroevolution and evolutionary algorithms has studied how com plex behaviors can be learned by directly maximizing diversity DIAYN특징 - acquire complex skills with minimal supervision to improve efficie ncy - focus on deriving a general, information theoretic objective that does not require manual design of distance metrics and can be a pplied to any RL task without additional engineering
  • 19.
    3. Related Work Intrinsicmotivation - previous works use an intrinsic motivation objective to learn a single policy DIAYN특징 - propose an objective for learning many, diverse policies
  • 20.
  • 21.
    4. Diversity isAll You Need Unsupervised RL paradigm - agent is allowed an unsupervised “exploration” stage followed b y a supervised stage - the aim of the unsupervised stage is to learn skills that eventu ally will make it easier to maximize the task reward in the super vised stage. - Conveniently, because skills are learned without a priori knowled ge of the task, the learned skills can be used for many different tas ks Maximize a mixture of policies (the collection of skills together wi th p(z))
  • 22.
    4. Diversity isAll You Need Unsupervised RL paradigm - agent is allowed an unsupervised “exploration” stage followed by a supervised stage - the aim of the unsupervised stage is to learn skills that eventually will make it easier to maximize the task reward in the supervised s tage. - Conveniently, because skills are learned without a priori knowled ge of the task, the learned skills can be used for many different tas ks Unsupervised and Supervised - the agent explores the environment, but does not receive any task reward Learn Skills
  • 23.
    4. Diversity isAll You Need Unsupervised RL paradigm - agent is allowed an unsupervised “exploration” stage followed by a supervised stage - the aim of the unsupervised stage is to learn skills that eventually will make it easier to maximize the task reward in the supervised s tage. - Conveniently, because skills are learned without a priori knowled ge of the task, the learned skills can be used for many different tas ks Unsupervised and Supervised - the agent receives the task reward, and its go al is to learn the task by maximizing the task r eward Maximize the task reward
  • 24.
    4.1 How itWorks? DIAYN : three ideas 1. The skill dictates the states that the agent visits 2. To distinguish skills, we use states not actions 3. The skills should be as diverse as possible
  • 25.
    4.1 How itWorks? DIAYN : three ideas 1. The skill dictates the states that the agent visits
  • 26.
    4.1 How itWorks? DIAYN : three ideas 1. The skill dictates the states that the agent visits Maximize Mutual Information between skills and states - also skill should control with states the agent visit MI(s, z)
  • 27.
    4.1 How itWorks? DIAYN : three ideas 1. The skill dictates the states that the agent visits - maximize the mutual information between skills and states, MI(s, z)
  • 28.
    4.1 How itWorks? DIAYN : three ideas 1. The skill dictates the states that the agent visits - maximize the mutual information between skills and states, MI(s, z) 2. To distinguish skills, we use states not actions
  • 29.
    4.1 How itWorks? DIAYN : three ideas 1. The skill dictates the states that the agent visits - maximize the mutual information between skills and states, MI(s, z) 2. To distinguish skills, we use states not actions To ensure that states, not action, are used to distinguish skills, we minimize the mutual information between skills and actions given the state, MI(a, z | s)
  • 30.
    4.1 How itWorks? DIAYN : three ideas 1. The skill dictates the states that the agent visits - maximize the mutual information between skills and states, MI(s, z) 2. To distinguish skills, we use states not actions - minimize the mutual information between skills and actions given the state, MI(a, z | s)
  • 31.
    4.1 How itWorks? DIAYN : three ideas 1. The skill dictates the states that the agent visits - maximize the mutual information between skills and states, MI(s, z) 2. To distinguish skills, we use states not actions - minimize the mutual information between skills and actions given the state, MI(a, z | s) 3. The skills should be as diverse as possible
  • 32.
    4.1 How itWorks? DIAYN : three ideas 1. The skill dictates the states that the agent visits - maximize the mutual information between skills and states, MI(s, z) 2. To distinguish skills, we use states not actions - minimize the mutual information between skills and actions given the state, MI(a, z | s) 3. The skills should be as diverse as possible Maximize a mixture of policies (the collection of skills together with p(z))
  • 33.
    4.1 How itWorks? DIAYN : three ideas 1. The skill dictates the states that the agent visits - maximize the mutual information between skills and states, MI(s, z) 2. To distinguish skills, we use states not actions - minimize the mutual information between skills and actions given the state, MI(a, z | s) 3. The skills should be as diverse as possible - maximize a mixture of poilicies (the collection of skills together with p(z))
  • 34.
    4.1 How itWorks?
  • 35.
    4.2 Implementation - Usessoft actor critic to learn a policy - Entropy regularizer is scaled by alpha - found empirically 0.01 - trade off between exploration and discriminability - Uses a pseudo-reward r_z to maximize the entropy
  • 36.
  • 37.
  • 38.
    5. What skillsare Learned? 1. Does entropy regularization lead to more diverse skills? - small alpha, learns skills that move large distances in directions but fail to explore large parts of the state space - increasing alpha, the skills visit a more diverse set of states, whi ch may help with exploration in complex state space - It is difficult to discriminate skills when alpha is further increas ed orientation, forward velocity
  • 39.
    5. What skillsare Learned? 2. How does the distribution of skills change during training - inverted pendulum and mountain car become increasingly divers e throughout training - skills are learned with no reward, so it is natural that some skills correspond to small task reward while others correspond to large t ask reward
  • 40.
    5. What skillsare Learned? 3. Does DIAYN explore effectively in complex environment? - half-cheetah, hopper, and ant - learn diverse locomotion primitives
  • 41.
    5. What skillsare Learned? 3. Does DIAYN explore effectively in complex environment? - evaluate all skills on three reward functions: running (maximize X coordinate), jumping (maximize Z coordinate) moving (maximize L2 distance from origin) - DIAYN learns some skills that achieve high reward - DIAYN optimizes a collection of policies, which enables more diver se exploration.
  • 42.
    5. What skillsare Learned? 4. Does DIAYN ever learn skills that solve a benchmark task? - half cheetah and hopper learns skills that run and hop forward q uickly => good
  • 43.
  • 44.
    6. Harnessing LearnedSkills Three perhaps less obvious applications are adapting skills to 1. maximize a reward 2. hierarchical RL 3. imitation learning
  • 45.
    6.1 Adapting Skillsto Maximize Reward - After DIAYN learns task-agnostic skills without supervision, we c an quickly adapt the skills can to solve a desired task - Akin to computer vision researchers using models pre-trained o n ImageNet - DIAYN as (unsupervised) pre-training in resource-constrained settings
  • 46.
    6.1 Adapting Skillsto Maximize Reward 5. Can we use learned skills to directly maximize the task rewa rd? - approach differs from this baseline only in how weights are initialized => good
  • 47.
    6.2 Using Skillsfor Hierarchical RL - In theory, hierarchical RL should decompose a complex task in to motion primitives, which may be reused for multiple tasks - In practice, algorithms for hierarchical RL encounter many difficul ties: 1. each motion primitive reduces to a single action 2. the hierarchical policy only samples a single motion primitive 3. all motion primitives attempt to do the entire task
  • 48.
    6.2 Using Skillsfor Hierarchical RL - In theory, hierarchical RL should decompose a complex task in to motion primitives, which may be reused for multiple tasks - In practice, algorithms for hierarchical RL encounter many difficul ties: 1. each motion primitive reduces to a single action [9] 2. the hierarchical policy only samples a single mo tion primitive [24] 3. all motion primitives attempt to do the entire task DIAYN discovers diverse, task-agnostic skills, which hold the promise of acting as a building block for hierarchical RL
  • 49.
    6.2 Using Skillsfor Hierarchical RL 6. Are skills discovered by DIAYN useful for hierarchical RL? - how DIAYN outperforms all baselines. TRPO and SAC are comp etitive on-policy and off-policy RL algorithms, while VIME includes an auxiliary objective to promote efficient exploration
  • 50.
    6.2 Using Skillsfor Hierarchical RL 7. How can DIAYN leverage prior knowledge about what skills will be useful? - In particular, we can condition the discriminator on only a subse t of the observation, forcing DIAYN to find skills that are divers e in this subspace (but potentially indistinguishable along other d imensions)
  • 51.
    6.3 Imitating anExpert 8. Can we use learned skills to imitate an expert? - consider the setting where we are given an expert trajectory con sisting of states (not actions)
  • 52.
    6.3 Imitating anExpert 8. Can we use learned skills to imitate an expert? - Given the expert trajectory, we use our learned discriminator to estimate which skill was most likely to have generated the tra jectory - this optimization problem, which we can solve for categorical z by simple enumerate, is equivalent to an M-projection
  • 53.
    6.3 Imitating anExpert 9. How does DIAYN differ from Variational Intrinsic Control? - maximum entropy policies and not learn the prior p(z) - found that DIAYN method consistently matched the expert trajectory more closely than VIC baselines without these elements - the ABC distribution over skills, p(z) is learned, the model may encount er a rich-get-richer problem
  • 54.
  • 55.
    7. Conclution - Inthis paper, DIAYN, a method for learning skills without rewar d functions - DIAYN learns diverse skills for complex tasks, often solving ben chmark tasks with one of the learned skills without actually receivi ng any task reward
  • 56.
    7. Conclution - proposedmethods for using the learned skills (1) to quickly adapt to a new task (2) to solve complex tasks via hierarchical RL (3) to imitate an expert - As a rule of thumb, DIAYN may make learning a task easier by r eplacing the task’s complex action space with a set of useful skills - DIAYN could be combined with methods for augmenting the obs ervation space and reward function
  • 57.
    7. Conclution - Usingthe common language of information theory, joint objecti ve can likely be derived. - DIAYN may also more efficiently learn from human preferences by having humans select among learned skills - Finally, for creativity and education, the skills produced by DIAYN might be used by game designers to allow players to control comp lex robots and by artists to design dancing robots.
  • 58.