6.1 Offline and supervised reinforcement learning
I. Distribution shift in offline RL.
A. Constrain the policy action space.
B. Incorporate value pessimism
C. Incorporate pessimism into learned dynamics models.
II. Learning wide behavior distibution
A. Learning a task-agnostic set of skill, eigher with likelihood-based approaches.
B. maximizing mutual information
III. Return conditioning/’supervised RL’
A. similar to DT. DT benefit from the use of long contexts for behavior modeling as long-term
credit assignment.
❏ Offline RLの分布シフト問題に取り組む研究がたくさんある!
❏ 強化学習をSupervised Learningとして扱う研究
6.2 Credit assignment(貢献度の分配)
❏ 報酬を最も重要なStepで与える必要があり、その分配を求める研究
❏ 実験通じて、Transformerが良さそうことが分かった
1. Self-Attentional Credit Assignment for Transfer in Reinforcement
Learning
2. Hindsight Credit Assignment
3. Counterfactual credit assignment in model-free reinforcement
learning
6.3 Conditional language generation
6.4 Attention and transformer models
❏ 条件付き言語生成、TransformerとAttentionなどの関連研究がたくさんある
Offline RL, Sequence modeling, goal condition by reward.
❏ アイデアが面白くて、関連研究がいっぱいでる予想
❏ 適切な報酬が知らないと困るので、解決できそうなアイデアを考えたい
Future work
- Stochastic Decision Transformer
- conditioning on return distributions to model stochastic settings instead of deterministic returns
- Model-based Decision Transformer.
- Transformer models can also be used to model the state evolution of trajectory
- For Real-world application
- Augmenting RL.
Decision Transformer
- Offline RL設定でGPT アーキテクチャを用いた。
- 適切なRewardを設定して、それを得られるActionを出力する。
- Model freeの手法(CQL)と比較し、うまくいってる。