Detailed version: https://www.slideshare.net/Seungjoon1/230915-paper-summary-learning-to-world-model-with-language-with-details-publicpdf
This is a personal paper summary of the paper "Learning to Model the World with Language", https://arxiv.org/abs/2308.01399
Some of the contents may be incorrect.
Please send me an email if you want to contact me: sjlee1218@postech.ac.kr (for correction or addition of materials, ideas to develop this paper, or others).
3. Caution!!!
• This is the material I summarized a paper at my personal research meeting.
• Some of the contents may be incorrect!
• Some contributions, experiments are excluded intentionally, because they
are not directly related to my research interest.
• Methods are simpli
fi
ed for easy explanation.
• Please send me an email if you want to contact me: sjlee1218@postech.ac.kr
(for correction or addition of materials, ideas to develop this paper, or others).
3
4. Situations
• Most language-conditioned RL methods only use language as instructions
(eg. “Pick the blue box”)
• However, language does not always match the optimal action.
• Therefore, mapping language only to actions is a weak learning signal.
4
“Put the bowls away”
5. Complication
• On the other hand, human can predict the future using language.
• Human can predict environment dynamics (eg. “wrenches tightens nuts.”)
• Human can predict the future observations (eg. “the paper is outside.”)
5
6. Questions & Hypothesis
• Question:
• If we let reinforcement learning predict the future using language, will its
performance improve?
• Hypothesis:
• Predicting the future representation provides a rich learning signal for
agents of how language relates to the world.
• Rich learning signal: frequent, stable training signal.
6
7. Contributions
• DynaLang enables RL agents to use diverse types of language, for example
hint or dynamics, along with instruction.
• DynaLang suggests the future prediction self-supervised objective to improve
the training performance.
7
8. Why is This New?
• Previous language-based RL methods either used language as only
instructions or only description of environment.
• DynaLang uni
fi
es these settings so that agents learns from diverse types of
language.
• Previous works mostly directly condition policies on language to generate
actions.
• DynaLang proposes the future prediction objective to train the world model
which associates language, image, and dynamics.
8
10. Problem Setting
• Observation: , where is an image, is a language token.
• An agent chooses action , then environment returns:
• reward ,
• a
fl
ag whether the episode continues ,
• and next observation .
•
The agent’s goal is to maximize
ot = (xt, lt) xt lt
at
rt+1
ct+1
ot+1
E
[
T
∑
t=1
γt−1
rt
]
10
11. Method Outline
• DynaLang components
• World model: encodes current image obs and language into representation.
• RL agent: using encoded representation, acts to maximize the sum of
discounted reward.
11
12. Method - World Model
Outline
• World model components:
• Encoder - Decoder: learns to represent the current state.
• Sequence model: learns to predict the future state representation.
12
13. Method - World Model
Base model (previous work)
• DynaLang = Dreamer V3 + language + future prediction objective.
• Dreamer V3 learns to compute compact representations of current state, and
learns how these concepts change by actions.
13
Architecture of Dreamer V3
14. Method - World Model
Incorporation of language
• DynaLang incorporates language into the encoder-decoder of Dremer V3.
• By this, DynaLang gets representations unifying visual observations and
languages.
14
15. Method - World Model
Prediction of the future
• DynaLang adds the future representation prediction into the sequence model
of Dreamer V3.
• Future representation prediction lets the agent extract the information from
language, relating to the dynamics of multiple modalities.
15
16. Method - World Model
Model Losses
• World model loss: , where
• Image loss
• Language loss
• Reward loss
• Continue loss
• Regularizer , where sg is stop-gradient
• Future prediction loss
Lx + Ll + Lr + Lc + Lreg + Lpred
Lx = || ̂
xt − x||2
2
Ll = categorical_cross_entropy( ̂
lt, lt)
Lr = ( ̂
rt − rt)2
Lc = binary_cross_entropy( ̂
ct, ct)
Lreg = βreg max(1,KL[zt |sg( ̂
zt)])
Lpred = βpred max(1,KL[sg(zt), ̂
zt])
16
17. Method - RL Agent
Outline
• The used RL agent is a simple actor critic agent.
• Actor:
• Critic:
• Note that the RL agent is not conditioned on language directly.
π(at |zt, ht)
V(ht, zt)
17
18. Method - RL Agent
Environment interaction
• The RL agent interacts with environment using the encoded representation
and history .
zt
ht
18
19. Method - RL Agent
Training
• Let , the estimated discounted sum of
future rewards.
• Critic loss:
• Actor loss: , maximizing the return estimate
• The agent is trained only using imagined rollout generated by the world model.
• The agent is trained by the action of the agent and the predicted states, rewards.
Rt = rt + γct ((1 − λ)V (zt+1, ht+1) + λRt+1)
Lϕ = (Vϕ(zt, ht) − Rt)
2
Lθ = − (Rt − V(zt, ht)) log πθ(at |ht, zt)
19
21. Experiments 1 - Diverse Types of Language
Questions
• Questions to address:
• Can DynaLang use diverse types of language along with instruction?
• If can, does it improve task performance?
21
22. Experiments 1 - Diverse Types of Language
Setup
• Env: HomeGrid
• multitask grid world where agents receive task
instruction in language but also language hints.
• Agents gets a reward of 1 when a task is completed,
and then a new task is sampled.
• Therefore, agents must complete as many tasks
as possible before the episode terminates in 100
steps.
•
22
HomeGrid env. Agents receive 3 typess of hints.
23. Experiments 1 - Diverse Types of Language
Results
• Baselines: model-free o
ff
-policy algorithms, IMPALA, R2D2.
• Simply image embeddings, language embeddings are conditioned to policy.
• DynaLang solves more tasks with hints, but simple language-conditioned RL
get worse with hints.
23
HomeGrid training performance after 50M steps (2 seeds)
24. Experiments 2 - Future Prediction
Questions
• Questions to address:
• Is adding future prediction more e
ff
ective than using language to only
generate actions?
24
25. Experiments 2 - Future Prediction
Setup
• Env: Messenger
• grid world where agents should deliver a message
while avoiding enemies using text manuals.
• Agents must understand manuals and relate them to
the environment to achieve high score.
25
Messenger env. Agent get text manuals.
26. Experiments 2 - Future Prediction
Results
• EMMA is added to be compared:
• Language + gridworld speci
fi
c method, using language only to generate action.
• Only DynaLang can learn from S3, the most di
ffi
cult setting.
• Adding future prediction helps the training more than only action generation.
• However, the authors do not include ablation studies which exclude the future
prediction loss from their architecture.
26
Messenger training performance (2 seeds). S1 is most easy, S3 is most hard.