[DL輪読会]Decision Transformer: Reinforcement Learning via Sequence Modeling

DEEP LEARNING JP
[DL Papers] Decision Transformer ：
Reinforcement Learning via sequence modeling
XIN ZHANG, Matsuo Lab
http://deeplearning.jp/

書誌情報
● タイトル：
○ Decision Transformer：Reinforcement Learning via sequence modeling
● 著者
○ Lili Chen*, Kevin Lu*, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel,
Aravind Srinivas*, Igor Mordatch
● 研究機関：UC Berkeley, Facebook AI Research, Google Brain
● 12 Jun 2021
● 概要
○ Transformerを用いて、RLを系列モデリングの手法として扱う手法を提案
○ Model-free offline RLのベースラインのSOTAと同等な精度.
2

Transformer
● 強力なTransformerをRLで使えないか？
● Self-Attentionが長い系列のRLを扱いやすそう
Offline RL
● 誤差の累積と価値関数のオーバー予測が課題
● Transformerを用いるには自然な設定
From CS 285

1 timestep
Decision Transformer(DT)
● GPTアーキテクチャ
○ 次のActionを予測する
○ 離散値：cross-entropy
○ 連続値：mean-squared
● returns-to-go：
○ ある時点のActionは、それ以降の
Rewardのみに影響を与える
○ Actionを予測するのに必要
● Feed K timesteps (3K tokens)

Illustrative example
❏ 状態Stateと期待されているRewardについて、学習データに似たようなものが
あれば、そのActionを出力する

4. Evaluations on Offline RL Benchmarks

❏ CQLと良い勝負。ただQbertが弱い。
❏ K=30 (except K=50 for Pong)
4.1 Atari(Breakout, Qbert, Pong, Seaquest)
Qbert

4.2 OpenAI Gym(HalfCheetah, Hopper, Walker, Reacher)
❏ OpenAI gymは大体勝ってる
❏ K=20 (except K=20 for Reacher)

5.1 Does DT perform BC on a subset of the data?
❏ Percentile BC：最適のデータを使う（最適がわからないので、非現実）
❏ BCとの違いを示そうとしている。

5.2 How well does DT model the distribution of returns?
❏ Rewardでとるべき行動の指定ができる。”最適な行動”だけではない。
❏ 一方で、適切なRewardの入力が求められる。わからない場合は困る。

5.3 What is the benefit of using a longer context length?
❏ When K = 1, such as RL, DT performs poorly.
❏ Kの設定が重要、タスクによって変わってるのでハイパーパラメータになる

5.4 Does DT perform effective long-term credit assignment?
❏ Key-to-Doorの設定では、DTが重要なものを捉えられている。
❏ データが増えるとBCでもできる。
Key-to-Doorの例（論文の図がない！）
- Key room(左)でKeyを取得する
- empty room(中)
- door room(右)でDoor(青)を開ける

5.5 Can DT be accurate critics in sparse reward settings?
❏ DTのAttentionはうまく機能している。
❏ （DTが得意そうな実験をデザインしている気がするが）

5.6 Does DT perform well in sparse reward settings?
❏ Delayed reward：最後にまとめてRewardを受けとる設定
❏ Decision Transformerへのダメージが最も小さい

6.1 Offline and supervised reinforcement learning
I. Distribution shift in offline RL.
A. Constrain the policy action space.
B. Incorporate value pessimism
C. Incorporate pessimism into learned dynamics models.
II. Learning wide behavior distibution
A. Learning a task-agnostic set of skill, eigher with likelihood-based approaches.
B. maximizing mutual information
III. Return conditioning/’supervised RL’
A. similar to DT. DT benefit from the use of long contexts for behavior modeling as long-term
credit assignment.
❏ Offline RLの分布シフト問題に取り組む研究がたくさんある！
❏ 強化学習をSupervised Learningとして扱う研究

6.2 Credit assignment(貢献度の分配)
❏ 報酬を最も重要なStepで与える必要があり、その分配を求める研究
❏ 実験通じて、Transformerが良さそうことが分かった
1. Self-Attentional Credit Assignment for Transfer in Reinforcement
Learning
2. Hindsight Credit Assignment
3. Counterfactual credit assignment in model-free reinforcement
learning

6.3 Conditional language generation
6.4 Attention and transformer models
❏ 条件付き言語生成、TransformerとAttentionなどの関連研究がたくさんある

Offline RL, Sequence modeling, goal condition by reward.
❏ アイデアが面白くて、関連研究がいっぱいでる予想
❏ 適切な報酬が知らないと困るので、解決できそうなアイデアを考えたい
Future work
- Stochastic Decision Transformer
- conditioning on return distributions to model stochastic settings instead of deterministic returns
- Model-based Decision Transformer.
- Transformer models can also be used to model the state evolution of trajectory
- For Real-world application
- Augmenting RL.
Decision Transformer
- Offline RL設定でGPT アーキテクチャを用いた。
- 適切なRewardを設定して、それを得られるActionを出力する。
- Model freeの手法(CQL)と比較し、うまくいってる。

Appendix
- Youtuber Yannic の解説

[DL輪読会]Decision Transformer: Reinforcement Learning via Sequence Modeling

More Related Content

What's hot

Similar to [DL輪読会]Decision Transformer: Reinforcement Learning via Sequence Modeling

More from Deep Learning JP

[DL輪読会]Decision Transformer: Reinforcement Learning via Sequence Modeling