SlideShare a Scribd company logo
1 of 17
1
Trajectory Transformer:
Reinforcement Learning as One Big
Sequence Modeling Problem
백승언
13 Aug, 2023
2
 Introduction
 Sequential Neural Network in Reinforcement Learning
 Trajectory Transformer
 Overview
 Model specification
 Planning technique based on tasks
 Experiments
 Environments and Dataset
 Results
Contents
3
Introduction
4
 Various models have utilized the sequential neural networks such as LSTMs, Seq2Seq models and
Transformer architectures
 Policy: ALD(Transformer), …
 Value: DRQN(LSTM), FRMQN(memory network), …
 Transition model: Dreamer(LSTM), TransDreamer(Transformer), …
 Multi-agent RL: QMIX(GRU), AlphaStar(LSTM), …
 While, previous works demonstrated the importance of such models for representing memory, they
still rely on standard RL algorithmic advances to improve performance.
 The goal in trajectory transformer is different: They aim to replace as much of the RL pipeline as possible with
sequence modeling
Sequential Neural Network in Reinforcement Learning
DRQN architecture Draemer / TransDreamer architecture QMIX architecture
5
Trajectory Transformer
6
 Overview of the Trajectory Transformer
 Previous model-free, model-based and offline RL algorithms require the following components
• Model-free algorithms: critic, actor(optional)
• Offline RL algorithms: dynamics model(optional), critic(optional), behavior constraints
 However, proposed Transformer-based model unified all components under single sequence model
• The advantages of this perspective is that high-capacity sequence model architectures can be brought to resolve the problem,
resulting in an approach that could benefit from the scalability underlying large-scale learning results
 Additionally, proposed model could achieve various tasks including imitation learning, goal-reaching, offline RL tasks
with simple modification on the same decoding procedure
• Their results suggest that the algorithms and architectural motifs that have been widely applicable in unsupervised learning carry
similar benefits in reinforcement learning
 They proposed a Transformer-based model for predicting the observation, action, and reward, and optimized the
objective function as follows
• ℒ 𝜏 = Σ𝑡=0
𝑇−1
Σ𝑖=0
𝑁−1
log 𝑃𝜃 𝐬𝑡
𝑖
| 𝐬𝑡
<𝑖
, 𝜏<𝑡 + Σ𝑗=0
𝑀−1
log𝑃𝜃 𝐚𝑡
𝑗
| 𝐚𝑡
<𝑗
, 𝜏<𝑡 + log 𝑃𝜃 𝑟𝑡 | 𝐚, 𝐬, 𝜏<𝑡
 Evaluation demonstrated that Trajectory Transformer showed significant performance in imitation learning tasks, goal-
reaching tasks, and offline RL tasks
Overview
• Model-based: dynamics model(optional), critic(optional), actor(optional)
7
 Trajectory data as unstructured sequence for modeling by a Transformer architecture
 A trajectory 𝜏 consists of 𝑁-dimensional states, 𝑀-dimensional actions, and scalar rewards
• 𝜏 = 𝑠𝑡
0
, 𝑠𝑡
1
, … , 𝑠𝑡
𝑁−1
, 𝑎𝑡
0
, 𝑎𝑡
1
, … , 𝑎𝑡
𝑀−1
, 𝑟𝑡 𝑡=0
𝑇−1
, 𝑖 𝑑𝑒𝑛𝑜𝑡𝑒𝑠 𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛, 𝑡 𝑑𝑒𝑛𝑜𝑡𝑒𝑠 𝑡𝑖𝑚𝑒𝑠𝑡𝑒𝑝
 Discretizing inputs(tokenization)
• They investigated two simple discretization approaches
– Uniform: Assuming a per-dimension vocabulary size of 𝑉, the tokens for state dimension 𝑖 cover uniformly-spaced
intervals of width (max 𝐬𝑖
− min 𝐬𝑖
)/𝑉
– Quantile: All tokens for a given dimension account for an equal amount of probability mass under the empirical data
distribution; each token accounts for 1 out of every V data points in the training set.
Model specification (I)
Architecture of the Trajectory Transformer
 Using Transformer decoder mirroring the GPT architecture
• They used small-scale language model consisting of four
layers and six self-attention heads
8
 Objective function
 The authors optimized the following objective, which is the standard teacher-forcing procedure used to
train autoregressive recurrent models.
• ℒ 𝜏 = Σ𝑡=0
𝑇−1
Σ𝑖=0
𝑁−1
log 𝑃𝜃 𝐬𝑡
𝑖
| 𝐬𝑡
<𝑖
, 𝜏<𝑡 + Σ𝑗=0
𝑀−1
log 𝑃𝜃 𝐚𝑡
𝑗
| 𝐚𝑡
<𝑗
, 𝜏<𝑡 + log 𝑃𝜃 𝑟𝑡 | 𝐚, 𝐬, 𝜏<𝑡 ,
• in which they use 𝜏<𝑡 as a shorthand for a tokenized trajectory from timesteps 0 through t-1
 Prediction horizon
 Due to the quadratic complexity of self-attention, they limited the maximum number of conditioning tokens
to 512, corresponding to a horizon of
512
𝑁+𝑀+1
transitions.
Model specification (II)
Horizon of Trajectory Transformer
Casual transformer
𝐬𝑡, 𝐚𝑡, 𝐫𝑡 𝐬𝑡+1, 𝐚𝑡+1, 𝐫𝑡+1 𝐬𝑡+𝐻, 𝐚𝑡+𝐻, 𝐫𝑡+𝐻
𝐬𝑡, 𝐚𝑡, 𝐫𝑡 𝐬𝑡+1, 𝐚𝑡+1, 𝐫𝑡+1 𝐬𝑡, 𝐚𝑡, 𝐫𝑡
𝑁 + 𝑀 + 1
512
^
9
Planning technique based on tasks (I)
 Beam search(BS)
 Heuristic technique for generating sequence widely used in NLP task
• BS selects the K-sequences which has high log-probability at each step
 Planning according to tasks
 For imitation learning
• This situation matches the goal of sequence modeling exactly.
– They used BS without modification by setting the conditioning input 𝐱 the
current state 𝑠𝑡
Example of Beam search(K=3)
 For goal-conditioned RL
• Proposed Transformer architecture features a “causal” attention mask to ensure that predictions only depends on
previous tokens in sequence.
• However, for goal-conditioned RL, they conditioned the goal state or final state 𝐬𝑇
– They decode trajectories with probabilities of the form: 𝑃𝜃 𝑠𝑡
𝑖
𝐬𝑡
<𝑖
, 𝝉<𝑡, 𝐬𝑇)
10
Planning technique based on tasks (II)
 Return-to-go(reward-to-go)
 Sum of reward along the trajectory until that time 𝑡
• 𝑅𝑡 = Σ𝑡′=𝑡
𝑇
𝛾𝑡′−𝑡
𝑟𝑡
 Planning according to tasks
 For offline RL
• By replacing the log-probabilities of transitions with the predicted reward signal, they could utilize the same
Trajectory Transformer and search strategy for reward maximizing behavior.
• However, using beam search as a reward-maximizing procedure has the risk of leading to myopic behavior
– To address this issue, they augment each transition in the training trajectories with return-to-go 𝑅𝑡 = Σ𝑡′=𝑡
𝑇
𝛾𝑡′−𝑡
𝑟𝑡 and
include it as an additional quantity, discretized identically to the others.
• Specifically, the model sample full transitions 𝐬𝑡, 𝐚𝑡, 𝑟𝑡, 𝑅𝑡 using likelihood-maximizing beam search, treat these
transitions as their vocabulary, and filter sequences of transitions by those with the highest cumulative reward plus
return-to-go estimate
11
Experiments
12
 Environment
 Four room environment
• Environment that agent must navigate in maze composed of four rooms
• To obtain a reward, the agent must reach the goal square. Both the agent and the goal square are randomly placed in any of the four
rooms.
 MuJoCo environment
• Environment includes diverse visual control tasks
– Halfcheetah, Humanoid, Walker, and so on
 D4RL: Datasets for Deep Data-Driven RL
 D4RL is data collection of well-known envs for offline RL
• Maze2D, AntMaze, MuJoCo, and so on
 Expert levels in D4RL MuJoCo
• Medium: generated by first training a policy online using SAC (1M)
• Medium-replay(Mixed): consist of recoding all samples in the replay buffer observed during training until policy reaches the medium
level performance
• Med-expert: mixing equal amounts of expert demonstrations and suboptimal data, generated via a partially trained policy or by
unrolling a uniform-at-random policy
Environment and Dataset
Four room env MuJoCo Env
13
 Model prediction results
 Experiments show better trajectory prediction performance compared to previous SOTA planning model
• The trajectory(100-step) generated by proposed model shows visually indistinguishable from those original dataset,
while in the single-step model, compounding errors lead to unsuitable trajectory predictions
 The authors also compared proposed model(causal transformer) and Markovian transformer(1-step) model
• In fully observable setting and partially observable setting(50% of states were randomly masked)
– Proposed model demonstrated the marginally superior accuracy in partial observable setting compared than Markovian
Transformer
Experimental results (I)
Generated trajectory in Humanoid Accuracies of generated trajectory in Humanoid
14
 Analysis of attention pattern
 The authors reported two distinct attention
patterns during trajectory prediction(in Hopper)
• Left: Both states and actions are dependent
primarily on the immediately preceding transition
– Markov property
• Right: Surprisingly, actions rely more on past
actions than they do on past states
Experimental results (II)
 Results in Imitation learning task
 Proposed model achieves an average
normalized return
• Return of 104% and 109% in the Hopper and
Walker 2d env, respectively
 Results in goal-reaching task
 Proposed model accomplished the goal-
reaching task with no reward shaping, reward
• The below figures show that generated
trajectories in four room env
Attention pattern(first and third layer attention head) in Hopper Trajectories of goal-reaching task collected by TTO
Starting state Goal state
15
Experimental results (III)
 Comparison with previous methods in Offline RL tasks
 Experiments showed comparable performance compared to previous SOTA methods in D4RL offline RL
benchmark(HalfCheetah, Hopper, Walker2d)
• CQL (Model-free RL)
• MOPO (Model-based RL)
 The authors assumed that the reason for the low performance in the HalfCheetah and med-expert dataset
setting was that the discretization of return was not refined as the performance of the expert data rapidly
improved.
Offline RL results in tabular form
Offline RL results
• MBOP (Model-based planning)
• BC (Behavior cloning)
• TTO(Proposed model)
16
Thank you!
17
Q&A

More Related Content

Similar to Trajectory Transformer.pptx

Adversarial Reinforced Learning for Unsupervised Domain Adaptation
Adversarial Reinforced Learning for Unsupervised Domain AdaptationAdversarial Reinforced Learning for Unsupervised Domain Adaptation
Adversarial Reinforced Learning for Unsupervised Domain Adaptationtaeseon ryu
 
Differential game theory for Traffic Flow Modelling
Differential game theory for Traffic Flow ModellingDifferential game theory for Traffic Flow Modelling
Differential game theory for Traffic Flow ModellingSerge Hoogendoorn
 
240318_JW_labseminar[Attention Is All You Need].pptx
240318_JW_labseminar[Attention Is All You Need].pptx240318_JW_labseminar[Attention Is All You Need].pptx
240318_JW_labseminar[Attention Is All You Need].pptxthanhdowork
 
Combinatorial optimization and deep reinforcement learning
Combinatorial optimization and deep reinforcement learningCombinatorial optimization and deep reinforcement learning
Combinatorial optimization and deep reinforcement learning민재 정
 
ELLA LC algorithm presentation in ICIP 2016
ELLA LC algorithm presentation in ICIP 2016ELLA LC algorithm presentation in ICIP 2016
ELLA LC algorithm presentation in ICIP 2016InVID Project
 
Paper study: Attention, learn to solve routing problems!
Paper study: Attention, learn to solve routing problems!Paper study: Attention, learn to solve routing problems!
Paper study: Attention, learn to solve routing problems!ChenYiHuang5
 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRUananth
 
PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...
PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...
PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...Dongmin Lee
 
Rethinking Attention with Performers
Rethinking Attention with PerformersRethinking Attention with Performers
Rethinking Attention with PerformersJoonhyung Lee
 
Reinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsReinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsPierre de Lacaze
 
Introduction to deep learning
Introduction to deep learningIntroduction to deep learning
Introduction to deep learningJunaid Bhat
 
Inside LoLA - Experiences from building a state space tool for place transiti...
Inside LoLA - Experiences from building a state space tool for place transiti...Inside LoLA - Experiences from building a state space tool for place transiti...
Inside LoLA - Experiences from building a state space tool for place transiti...Universität Rostock
 
A Generalization of Transformer Networks to Graphs.pptx
A Generalization of Transformer Networks to Graphs.pptxA Generalization of Transformer Networks to Graphs.pptx
A Generalization of Transformer Networks to Graphs.pptxssuser2624f71
 
safe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningsafe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningRyo Iwaki
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You NeedDaiki Tanaka
 
Machine learning Module-2, 6th Semester Elective
Machine learning Module-2, 6th Semester ElectiveMachine learning Module-2, 6th Semester Elective
Machine learning Module-2, 6th Semester ElectiveMayuraD1
 
Deep learning from a novice perspective
Deep learning from a novice perspectiveDeep learning from a novice perspective
Deep learning from a novice perspectiveAnirban Santara
 
CounterFactual Explanations.pdf
CounterFactual Explanations.pdfCounterFactual Explanations.pdf
CounterFactual Explanations.pdfBong-Ho Lee
 
A temporal classifier system using spiking neural networks
A temporal classifier system using spiking neural networksA temporal classifier system using spiking neural networks
A temporal classifier system using spiking neural networksDaniele Loiacono
 
Two strategies for large-scale multi-label classification on the YouTube-8M d...
Two strategies for large-scale multi-label classification on the YouTube-8M d...Two strategies for large-scale multi-label classification on the YouTube-8M d...
Two strategies for large-scale multi-label classification on the YouTube-8M d...Dalei Li
 

Similar to Trajectory Transformer.pptx (20)

Adversarial Reinforced Learning for Unsupervised Domain Adaptation
Adversarial Reinforced Learning for Unsupervised Domain AdaptationAdversarial Reinforced Learning for Unsupervised Domain Adaptation
Adversarial Reinforced Learning for Unsupervised Domain Adaptation
 
Differential game theory for Traffic Flow Modelling
Differential game theory for Traffic Flow ModellingDifferential game theory for Traffic Flow Modelling
Differential game theory for Traffic Flow Modelling
 
240318_JW_labseminar[Attention Is All You Need].pptx
240318_JW_labseminar[Attention Is All You Need].pptx240318_JW_labseminar[Attention Is All You Need].pptx
240318_JW_labseminar[Attention Is All You Need].pptx
 
Combinatorial optimization and deep reinforcement learning
Combinatorial optimization and deep reinforcement learningCombinatorial optimization and deep reinforcement learning
Combinatorial optimization and deep reinforcement learning
 
ELLA LC algorithm presentation in ICIP 2016
ELLA LC algorithm presentation in ICIP 2016ELLA LC algorithm presentation in ICIP 2016
ELLA LC algorithm presentation in ICIP 2016
 
Paper study: Attention, learn to solve routing problems!
Paper study: Attention, learn to solve routing problems!Paper study: Attention, learn to solve routing problems!
Paper study: Attention, learn to solve routing problems!
 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRU
 
PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...
PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...
PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...
 
Rethinking Attention with Performers
Rethinking Attention with PerformersRethinking Attention with Performers
Rethinking Attention with Performers
 
Reinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsReinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural Nets
 
Introduction to deep learning
Introduction to deep learningIntroduction to deep learning
Introduction to deep learning
 
Inside LoLA - Experiences from building a state space tool for place transiti...
Inside LoLA - Experiences from building a state space tool for place transiti...Inside LoLA - Experiences from building a state space tool for place transiti...
Inside LoLA - Experiences from building a state space tool for place transiti...
 
A Generalization of Transformer Networks to Graphs.pptx
A Generalization of Transformer Networks to Graphs.pptxA Generalization of Transformer Networks to Graphs.pptx
A Generalization of Transformer Networks to Graphs.pptx
 
safe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningsafe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learning
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need
 
Machine learning Module-2, 6th Semester Elective
Machine learning Module-2, 6th Semester ElectiveMachine learning Module-2, 6th Semester Elective
Machine learning Module-2, 6th Semester Elective
 
Deep learning from a novice perspective
Deep learning from a novice perspectiveDeep learning from a novice perspective
Deep learning from a novice perspective
 
CounterFactual Explanations.pdf
CounterFactual Explanations.pdfCounterFactual Explanations.pdf
CounterFactual Explanations.pdf
 
A temporal classifier system using spiking neural networks
A temporal classifier system using spiking neural networksA temporal classifier system using spiking neural networks
A temporal classifier system using spiking neural networks
 
Two strategies for large-scale multi-label classification on the YouTube-8M d...
Two strategies for large-scale multi-label classification on the YouTube-8M d...Two strategies for large-scale multi-label classification on the YouTube-8M d...
Two strategies for large-scale multi-label classification on the YouTube-8M d...
 

Recently uploaded

Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 

Recently uploaded (20)

Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 

Trajectory Transformer.pptx

  • 1. 1 Trajectory Transformer: Reinforcement Learning as One Big Sequence Modeling Problem 백승언 13 Aug, 2023
  • 2. 2  Introduction  Sequential Neural Network in Reinforcement Learning  Trajectory Transformer  Overview  Model specification  Planning technique based on tasks  Experiments  Environments and Dataset  Results Contents
  • 4. 4  Various models have utilized the sequential neural networks such as LSTMs, Seq2Seq models and Transformer architectures  Policy: ALD(Transformer), …  Value: DRQN(LSTM), FRMQN(memory network), …  Transition model: Dreamer(LSTM), TransDreamer(Transformer), …  Multi-agent RL: QMIX(GRU), AlphaStar(LSTM), …  While, previous works demonstrated the importance of such models for representing memory, they still rely on standard RL algorithmic advances to improve performance.  The goal in trajectory transformer is different: They aim to replace as much of the RL pipeline as possible with sequence modeling Sequential Neural Network in Reinforcement Learning DRQN architecture Draemer / TransDreamer architecture QMIX architecture
  • 6. 6  Overview of the Trajectory Transformer  Previous model-free, model-based and offline RL algorithms require the following components • Model-free algorithms: critic, actor(optional) • Offline RL algorithms: dynamics model(optional), critic(optional), behavior constraints  However, proposed Transformer-based model unified all components under single sequence model • The advantages of this perspective is that high-capacity sequence model architectures can be brought to resolve the problem, resulting in an approach that could benefit from the scalability underlying large-scale learning results  Additionally, proposed model could achieve various tasks including imitation learning, goal-reaching, offline RL tasks with simple modification on the same decoding procedure • Their results suggest that the algorithms and architectural motifs that have been widely applicable in unsupervised learning carry similar benefits in reinforcement learning  They proposed a Transformer-based model for predicting the observation, action, and reward, and optimized the objective function as follows • ℒ 𝜏 = Σ𝑡=0 𝑇−1 Σ𝑖=0 𝑁−1 log 𝑃𝜃 𝐬𝑡 𝑖 | 𝐬𝑡 <𝑖 , 𝜏<𝑡 + Σ𝑗=0 𝑀−1 log𝑃𝜃 𝐚𝑡 𝑗 | 𝐚𝑡 <𝑗 , 𝜏<𝑡 + log 𝑃𝜃 𝑟𝑡 | 𝐚, 𝐬, 𝜏<𝑡  Evaluation demonstrated that Trajectory Transformer showed significant performance in imitation learning tasks, goal- reaching tasks, and offline RL tasks Overview • Model-based: dynamics model(optional), critic(optional), actor(optional)
  • 7. 7  Trajectory data as unstructured sequence for modeling by a Transformer architecture  A trajectory 𝜏 consists of 𝑁-dimensional states, 𝑀-dimensional actions, and scalar rewards • 𝜏 = 𝑠𝑡 0 , 𝑠𝑡 1 , … , 𝑠𝑡 𝑁−1 , 𝑎𝑡 0 , 𝑎𝑡 1 , … , 𝑎𝑡 𝑀−1 , 𝑟𝑡 𝑡=0 𝑇−1 , 𝑖 𝑑𝑒𝑛𝑜𝑡𝑒𝑠 𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛, 𝑡 𝑑𝑒𝑛𝑜𝑡𝑒𝑠 𝑡𝑖𝑚𝑒𝑠𝑡𝑒𝑝  Discretizing inputs(tokenization) • They investigated two simple discretization approaches – Uniform: Assuming a per-dimension vocabulary size of 𝑉, the tokens for state dimension 𝑖 cover uniformly-spaced intervals of width (max 𝐬𝑖 − min 𝐬𝑖 )/𝑉 – Quantile: All tokens for a given dimension account for an equal amount of probability mass under the empirical data distribution; each token accounts for 1 out of every V data points in the training set. Model specification (I) Architecture of the Trajectory Transformer  Using Transformer decoder mirroring the GPT architecture • They used small-scale language model consisting of four layers and six self-attention heads
  • 8. 8  Objective function  The authors optimized the following objective, which is the standard teacher-forcing procedure used to train autoregressive recurrent models. • ℒ 𝜏 = Σ𝑡=0 𝑇−1 Σ𝑖=0 𝑁−1 log 𝑃𝜃 𝐬𝑡 𝑖 | 𝐬𝑡 <𝑖 , 𝜏<𝑡 + Σ𝑗=0 𝑀−1 log 𝑃𝜃 𝐚𝑡 𝑗 | 𝐚𝑡 <𝑗 , 𝜏<𝑡 + log 𝑃𝜃 𝑟𝑡 | 𝐚, 𝐬, 𝜏<𝑡 , • in which they use 𝜏<𝑡 as a shorthand for a tokenized trajectory from timesteps 0 through t-1  Prediction horizon  Due to the quadratic complexity of self-attention, they limited the maximum number of conditioning tokens to 512, corresponding to a horizon of 512 𝑁+𝑀+1 transitions. Model specification (II) Horizon of Trajectory Transformer Casual transformer 𝐬𝑡, 𝐚𝑡, 𝐫𝑡 𝐬𝑡+1, 𝐚𝑡+1, 𝐫𝑡+1 𝐬𝑡+𝐻, 𝐚𝑡+𝐻, 𝐫𝑡+𝐻 𝐬𝑡, 𝐚𝑡, 𝐫𝑡 𝐬𝑡+1, 𝐚𝑡+1, 𝐫𝑡+1 𝐬𝑡, 𝐚𝑡, 𝐫𝑡 𝑁 + 𝑀 + 1 512 ^
  • 9. 9 Planning technique based on tasks (I)  Beam search(BS)  Heuristic technique for generating sequence widely used in NLP task • BS selects the K-sequences which has high log-probability at each step  Planning according to tasks  For imitation learning • This situation matches the goal of sequence modeling exactly. – They used BS without modification by setting the conditioning input 𝐱 the current state 𝑠𝑡 Example of Beam search(K=3)  For goal-conditioned RL • Proposed Transformer architecture features a “causal” attention mask to ensure that predictions only depends on previous tokens in sequence. • However, for goal-conditioned RL, they conditioned the goal state or final state 𝐬𝑇 – They decode trajectories with probabilities of the form: 𝑃𝜃 𝑠𝑡 𝑖 𝐬𝑡 <𝑖 , 𝝉<𝑡, 𝐬𝑇)
  • 10. 10 Planning technique based on tasks (II)  Return-to-go(reward-to-go)  Sum of reward along the trajectory until that time 𝑡 • 𝑅𝑡 = Σ𝑡′=𝑡 𝑇 𝛾𝑡′−𝑡 𝑟𝑡  Planning according to tasks  For offline RL • By replacing the log-probabilities of transitions with the predicted reward signal, they could utilize the same Trajectory Transformer and search strategy for reward maximizing behavior. • However, using beam search as a reward-maximizing procedure has the risk of leading to myopic behavior – To address this issue, they augment each transition in the training trajectories with return-to-go 𝑅𝑡 = Σ𝑡′=𝑡 𝑇 𝛾𝑡′−𝑡 𝑟𝑡 and include it as an additional quantity, discretized identically to the others. • Specifically, the model sample full transitions 𝐬𝑡, 𝐚𝑡, 𝑟𝑡, 𝑅𝑡 using likelihood-maximizing beam search, treat these transitions as their vocabulary, and filter sequences of transitions by those with the highest cumulative reward plus return-to-go estimate
  • 12. 12  Environment  Four room environment • Environment that agent must navigate in maze composed of four rooms • To obtain a reward, the agent must reach the goal square. Both the agent and the goal square are randomly placed in any of the four rooms.  MuJoCo environment • Environment includes diverse visual control tasks – Halfcheetah, Humanoid, Walker, and so on  D4RL: Datasets for Deep Data-Driven RL  D4RL is data collection of well-known envs for offline RL • Maze2D, AntMaze, MuJoCo, and so on  Expert levels in D4RL MuJoCo • Medium: generated by first training a policy online using SAC (1M) • Medium-replay(Mixed): consist of recoding all samples in the replay buffer observed during training until policy reaches the medium level performance • Med-expert: mixing equal amounts of expert demonstrations and suboptimal data, generated via a partially trained policy or by unrolling a uniform-at-random policy Environment and Dataset Four room env MuJoCo Env
  • 13. 13  Model prediction results  Experiments show better trajectory prediction performance compared to previous SOTA planning model • The trajectory(100-step) generated by proposed model shows visually indistinguishable from those original dataset, while in the single-step model, compounding errors lead to unsuitable trajectory predictions  The authors also compared proposed model(causal transformer) and Markovian transformer(1-step) model • In fully observable setting and partially observable setting(50% of states were randomly masked) – Proposed model demonstrated the marginally superior accuracy in partial observable setting compared than Markovian Transformer Experimental results (I) Generated trajectory in Humanoid Accuracies of generated trajectory in Humanoid
  • 14. 14  Analysis of attention pattern  The authors reported two distinct attention patterns during trajectory prediction(in Hopper) • Left: Both states and actions are dependent primarily on the immediately preceding transition – Markov property • Right: Surprisingly, actions rely more on past actions than they do on past states Experimental results (II)  Results in Imitation learning task  Proposed model achieves an average normalized return • Return of 104% and 109% in the Hopper and Walker 2d env, respectively  Results in goal-reaching task  Proposed model accomplished the goal- reaching task with no reward shaping, reward • The below figures show that generated trajectories in four room env Attention pattern(first and third layer attention head) in Hopper Trajectories of goal-reaching task collected by TTO Starting state Goal state
  • 15. 15 Experimental results (III)  Comparison with previous methods in Offline RL tasks  Experiments showed comparable performance compared to previous SOTA methods in D4RL offline RL benchmark(HalfCheetah, Hopper, Walker2d) • CQL (Model-free RL) • MOPO (Model-based RL)  The authors assumed that the reason for the low performance in the HalfCheetah and med-expert dataset setting was that the discretization of return was not refined as the performance of the expert data rapidly improved. Offline RL results in tabular form Offline RL results • MBOP (Model-based planning) • BC (Behavior cloning) • TTO(Proposed model)

Editor's Notes

  1. 이제 본격적인 제가 선정한 논문의 알고리즘에 대해서 발표를 시작하겠습니다.