Diversity is all you need(DIAYN) : Learning Skills without a Reward Function

Diversity is All You Need :
Learning Skills without a Reward Function
김예찬(Paul Kim)

Index
1. Abstract
2. Introduction
3. Related Work
4. Diversity is All You Need
4.1 How it Works
4.2 Implementation
5. What Skills are Learned?
6. Harnessing Learned Skills
6.1 Adapting Skills to Maximize Reward
6.2 Using Skills for Hierachical RL
6.3 Imitation an Expert
7. Conclusion

1. Abstract
DIAYN(Diversity is All You Need)
- Agent can explore their environment and learn useful skills witho
ut supervision(감독)
- DIYAN can learning usefull sklls without a reward function
- maximum entropy policy을 활용하며 information theoretic를 m
aximizing하는 방식
- DIAYN을 효과적인 pretraining 방법론 제시함. exploration과 data
efficiency측면에서 RL의 문제를 극복

2. Introduction
DRL has been demonstrated to effectively learn a wide range of re
ward driven skills, including
1. play games
2. controlling robots
3. navigation

2. Introduction
DRL has been demonstrated to effectively learn a wide range of re
ward driven skills, including
1. play games
2. controlling robots
3. navigation
DIAYN
Not Reward
Driven

2. Introduction
DIAYN : Unsupervised skill discovery
- Learning usefull skills without supervision은 spares reward ta
sk인 경우 exploration을 하는데 도움을 줄 수 있음
- For long horizon tasks, skills discovered without reward can serv
e as primitives for HRL, effectively shortening the episode length
- human feedback : ex) reward design
- reward function을 design하는데 많은 시간을 투자할 필요가 없
음

2. Introduction
What is Skill?
- Skill은 환경의 state를 consistent way(일관된 방식)으로 변경시키
는 policy임
- skills might be useless
- skills are not only distinguishable, but also are as diverse as p
ossible
- Diverse skills are robust to perturbations and better exploring
the environment

2. Introduction
핵심 아이디어
distinguishable하며 diversity한 skill들을 습득하자
- object based on mutual information
- application : HRL, imitation Learning

2. Introduction
Contribution 5가지
1. method for learning useful skills without any rewards
- maximizing an information theoretic, maximum entropy policy
2. simple exploration objective results in the unsupervised emerge
nce skills
- (running, jumping), some of learned skills solve task..
3. simple method for using learned skills for HRL and find this met
hods solves tasks
4. how skills discovered can be quickly adapted to solve new task
5. skills discovered can be used for imitation learning

3. Related Work
HRL Perspective
Previous work
- HRL has learned skills to maximize a single, known, reward f
unction by jointly learning a set of skills and meta-controller
- in joint training, meta-policy does not select ‘bad’ options, so t
hese options do not receive any reward signal to improve
DIAYN특징
- random meta-policy를 제시
- learns skills with no reward

3. Related Work
Connection between RL and information theory
Previous work
- mutual information between states and actions as a notion of e
mpowerment for an intrinsically motivated agent
- discriminability objective is equivalent to maximizing the mutu
al information between latent skill z and some aspect of the corres
ponding trajectory
- setting with many tasks, and reward function
- setting with a single task reward

3. Related Work
Previous work
- mutual information between states and actions as a notion of e
mpowerment for an intrinsically motivated agent
- discriminability objective is equivalent to maximizing the mutu
al information between latent skill $z$ and some aspect of the corr
esponding trajectory
- setting with many tasks, and reward function
- setting with a single task reward
DIAYN특징
- maximize the mutual information between states and skills(
can be interpreted as maximizing the empowerment of a hierarc
hical agent whoes action space is the set of skills)

3. Related Work
DIAYN특징
- maximum entropy policies to force skill to be diverse
- fix the distribution p(z) rather than learning it, preventing p(z) fr
om collapsing to sampling only handful of skills.
- discriminator looks at every state, which provides additional rew
ard signal

3. Related Work
Neuroevolution and evolutionary algorithms
- neuroevolution and evolutionary algorithms has studied how com
plex behaviors can be learned by directly maximizing diversity
DIAYN특징
- acquire complex skills with minimal supervision to improve efficie
ncy
- focus on deriving a general, information theoretic objective that
does not require manual design of distance metrics and can be a
pplied to any RL task without additional engineering

3. Related Work
Intrinsic motivation
- previous works use an intrinsic motivation objective to learn a
single policy
DIAYN특징
- propose an objective for learning many, diverse policies

Unsupervised RL paradigm
- agent is allowed an unsupervised “exploration” stage followed b
y a supervised stage
- the aim of the unsupervised stage is to learn skills that eventu
ally will make it easier to maximize the task reward in the super
vised stage.
- Conveniently, because skills are learned without a priori knowled
ge of the task, the learned skills can be used for many different tas
ks
Maximize a mixture of policies (the collection of skills together wi
th p(z))

- agent is allowed an unsupervised “exploration” stage followed by
a supervised stage
- the aim of the unsupervised stage is to learn skills that eventually
will make it easier to maximize the task reward in the supervised s
tage.
ks
Unsupervised and Supervised
- the agent explores the environment, but does
not receive any task reward
Learn Skills

- agent is allowed an unsupervised “exploration” stage followed by
a supervised stage
- the aim of the unsupervised stage is to learn skills that eventually
will make it easier to maximize the task reward in the supervised s
tage.
ks
Unsupervised and Supervised
- the agent receives the task reward, and its go
al is to learn the task by maximizing the task r
eward
Maximize the
task reward

4.1 How it Works?
DIAYN : three ideas
1. The skill dictates the states that the agent visits
2. To distinguish skills, we use states not actions
3. The skills should be as diverse as possible

4.1 How it Works?
DIAYN : three ideas

4.1 How it Works?
DIAYN : three ideas
Maximize Mutual Information between skills and states
- also skill should control with states the agent visit
MI(s, z)

4.1 How it Works?
DIAYN : three ideas
- maximize the mutual information between skills and states, MI(s, z)

4.1 How it Works?
DIAYN : three ideas

4.1 How it Works?
DIAYN : three ideas
To ensure that states, not action, are used to distinguish skills,
we minimize the mutual information between skills and actions
given the state, MI(a, z | s)

4.1 How it Works?
DIAYN : three ideas
- minimize the mutual information between skills and actions given the state, MI(a, z | s)

4.1 How it Works?
DIAYN : three ideas

4.1 How it Works?
DIAYN : three ideas
Maximize a mixture of policies (the collection of skills together
with p(z))

4.1 How it Works?
DIAYN : three ideas
- maximize a mixture of poilicies (the collection of skills together with p(z))

4.2 Implementation
- Uses soft actor critic to learn a policy
- Entropy regularizer is scaled by alpha
- found empirically 0.01
- trade off between exploration and discriminability
- Uses a pseudo-reward r_z to maximize the entropy

5. What skills are Learned?
1. Does entropy regularization lead to more diverse skills?
- small alpha, learns skills that move large distances in directions
but fail to explore large parts of the state space
- increasing alpha, the skills visit a more diverse set of states, whi
ch may help with exploration in complex state space
- It is difficult to discriminate skills when alpha is further increas
ed
orientation, forward velocity

2. How does the distribution of skills change during training
- inverted pendulum and mountain car become increasingly divers
e throughout training
- skills are learned with no reward, so it is natural that some skills
correspond to small task reward while others correspond to large t
ask reward

3. Does DIAYN explore effectively in complex environment?
- half-cheetah, hopper, and ant
- learn diverse locomotion primitives

3. Does DIAYN explore effectively in complex environment?
- evaluate all skills on three reward functions:
running (maximize X coordinate), jumping (maximize Z coordinate)
moving (maximize L2 distance from origin)
- DIAYN learns some skills that achieve high reward
- DIAYN optimizes a collection of policies, which enables more diver
se exploration.

4. Does DIAYN ever learn skills that solve a benchmark task?
- half cheetah and hopper learns skills that run and hop forward q
uickly => good

6. Harnessing Learned Skills
Three perhaps less obvious applications are adapting skills to
1. maximize a reward
2. hierarchical RL
3. imitation learning

- After DIAYN learns task-agnostic skills without supervision, we c
an quickly adapt the skills can to solve a desired task
- Akin to computer vision researchers using models pre-trained o
n ImageNet
- DIAYN as (unsupervised) pre-training in resource-constrained
settings

5. Can we use learned skills to directly maximize the task rewa
rd?
- approach differs from this baseline only in how weights are
initialized => good

6.2 Using Skills for Hierarchical RL
- In theory, hierarchical RL should decompose a complex task in
to motion primitives, which may be reused for multiple tasks
- In practice, algorithms for hierarchical RL encounter many difficul
ties:
1. each motion primitive reduces to a single action
2. the hierarchical policy only samples a single
motion primitive
3. all motion primitives attempt to do the entire task

- In theory, hierarchical RL should decompose a complex task in
to motion primitives, which may be
reused for multiple tasks
- In practice, algorithms for hierarchical RL encounter many difficul
ties:
1. each motion primitive reduces to a single action [9]
2. the hierarchical policy only samples a single
mo tion primitive [24]
3. all motion primitives attempt to do the entire
task
DIAYN discovers diverse, task-agnostic skills, which hold the
promise of acting as a building block for hierarchical RL

6. Are skills discovered by DIAYN useful for hierarchical RL?
- how DIAYN outperforms all baselines. TRPO and SAC are comp
etitive on-policy and off-policy RL algorithms, while VIME includes
an auxiliary objective to promote efficient exploration

7. How can DIAYN leverage prior knowledge about what skills
will be useful?
- In particular, we can condition the discriminator on only a subse
t of the observation, forcing DIAYN to find skills that are divers
e in this subspace (but potentially indistinguishable along other d
imensions)

6.3 Imitating an Expert
8. Can we use learned skills to imitate an expert?
- consider the setting where we are given an expert trajectory con
sisting of states (not actions)

8. Can we use learned skills to imitate an expert?
- Given the expert trajectory, we use our learned discriminator
to estimate which skill was most likely to have generated the tra
jectory
- this optimization problem, which we can solve for categorical z
by simple enumerate, is equivalent to an M-projection

9. How does DIAYN differ from Variational Intrinsic Control?
- maximum entropy policies and not learn the prior p(z)
- found that DIAYN method consistently matched the expert trajectory
more closely than VIC baselines without these elements
- the ABC distribution over skills, p(z) is learned, the model may encount
er a rich-get-richer problem

7. Conclution
- In this paper, DIAYN, a method for learning skills without rewar
d functions
- DIAYN learns diverse skills for complex tasks, often solving ben
chmark tasks with one of the learned skills without actually receivi
ng any task reward

7. Conclution
- proposed methods for using the learned skills
(1) to quickly adapt to a new task
(2) to solve complex tasks via hierarchical RL
(3) to imitate an expert
- As a rule of thumb, DIAYN may make learning a task easier by r
eplacing the task’s complex action space with a set of useful
skills
- DIAYN could be combined with methods for augmenting the obs
ervation space and reward function

7. Conclution
- Using the common language of information theory, joint objecti
ve can likely be derived.
- DIAYN may also more efficiently learn from human preferences
by having humans select among learned skills
- Finally, for creativity and education, the skills produced by DIAYN
might be used by game designers to allow players to control comp
lex robots and by artists to design dancing robots.

Diversity is all you need(DIAYN) : Learning Skills without a Reward Function

More Related Content

What's hot

Similar to Diversity is all you need(DIAYN) : Learning Skills without a Reward Function

More from Yechan(Paul) Kim

Recently uploaded

Diversity is all you need(DIAYN) : Learning Skills without a Reward Function

Editor's Notes