SlideShare a Scribd company logo
1 of 16
1
Addressing Optimism Bias in
Sequence Modeling for RL
(SPLT Transformer)
백승언
22 Oct, 2023
2
 Introduction
 Limitations in previous offline reinforcement learning
 SPLT Transformer
 Sampling-based planning algorithm
 SPLT Transformer
 Experiments
 Environments
 Results
Contents
3
Introduction
4
 Most prior works in offline RL have focused on the mainly deterministic D4RL benchmarks
and weakly stochastic Atrai benchmarks
 Therefore, there has been limited focus on the difficulties of deploying such methods in largely stochastic
domains such as autonomous driving, transportation, and finance
 Recently, some works have explored leveraging high-capacity sequence models in
sequential decision-making problems.
 However, these methods focused on deterministic env and utilized naïve action selection techniques.
• The authors supposed that it could lead to overly aggressive and optimistic behavior
Limitations in previous offline reinforcement learning
Return query in Decision Transformer(DT) Beam search in Trajectory Transformer(TT)
5
SPLT Transformer
6
 Process of the conventional speed planning algorithm in autonomous vehicle
 Sampling-based planning algorithm using time gap distribution
Sampling-based planning algorithm
𝒕𝒑𝒓𝒆
Speed
[km/h]
Time [s]
Features related to
the preceding vehicle
• Time gap
• Relative distance
• Relative speed
Features related to
the ego-vehicle
• Speed
• Acceleration
• Jerk
Optimal speed trajectory selection
 Cost-function based evaluation
Speed trajectory calculation
 Calculating based on
vehicle dynamics
𝒕𝒑𝒓𝒆
Speed
[km/h]
Time [s]
Time gap candidate generation
 Random sampling and profiling
𝒕𝒑𝒓𝒆
Time [s]
Time
gap
[s]
𝒕𝒑𝒓𝒆 : Prediction time
Time gap candidates
Speed trajectories
Selected optimal speed trajectory
7
 Overview of the SPLT Transformer
 Existing offlineRL algorithms have generally been applied to deterministic/weakly stochastic environments
different largely from the real-world
• D4RL benchmark, Atari benchmark, and so on
 The proposed model is designed with a separated transformer-based VAE model for predicting the action,
observation, reward, and discounted return
• Transformer-based encoders encode the transition history for policy decoder and world model decoder
• The policy decoder estimates the next action depends on the action excepted transition history
• The world model decoder estimates the observation, reward, and discounted return
 Additionally, they enhanced the planning technique for offlneRL as a sequence modeling method for
addressing the optimistic/sub-optimal behavior
• They utilized a sampling-based planning technique which is the selection of the best trajectory within the generated
candidate trajectory set
 Evaluation demonstrated that SPLT Transformer has outperformed in self-driving tasks which has a large
stochasticity in terms of success ratio and generalization performance
SPLT Transformer (I) – Overview
8
 SeParated Latent Trajectory Transformer(SPLT Transformer)
 They designed the separated Transformer-based discrete latent variable VAEs to represent policy and world models
SPLT Transformer (II) – Architecture
The architecture of SPLT Transformer for generating a reconstruction prediction
 Encoders
• Both the world encoder 𝑞𝜙𝑤
and policy encoder 𝑞𝜙𝜋
use
the same architecture(non masking GPT architecture)
and receive the same trajectory 𝜏𝑡
𝐾
– 𝜏𝑡
𝐾
= {𝑠𝑡, 𝑎𝑡, 𝑠𝑡+1, 𝑎𝑡+1, … , 𝑠𝑡+𝐾, 𝑎𝑡+𝑘}
• These encoders output a 𝑛𝑤 or 𝑛𝜋 dimensional discrete
latent variable with each dimension having 𝑐 possible
values
– 𝑧𝑡
𝑤
~𝑞𝜙𝑤
⋅ 𝜏𝑡
𝐾
, 𝑧𝑡
𝑤
∈ 1, … , 𝑐 𝑛𝑤
– 𝑧𝑡
𝜋
~𝑞𝜙𝜋
⋅ 𝜏𝑡
𝐾
, 𝑧𝑡
𝜋
∈ 1, … , 𝑐 𝑛𝜋
 Policy decoder
• The policy decoder uses a similar input trajectory
representation and use the causal Transformer
– 𝜏𝑡
′𝑘
= {𝑠𝑡, 𝑎𝑡, 𝑠𝑡+1, 𝑎𝑡+1, … , 𝑠𝑡+𝑘}
• Then, the policy decoder takes the latent variable 𝑧𝜋, and
output mean of policy distribution which is assumed with
isotropic Gaussian
– 𝑝𝜃𝜋
𝑎𝑡+𝑘 𝜏𝑡
′𝑘
; 𝑧𝜋
≔ 𝒩 𝑓
𝜋 𝜏𝑡
′𝑘
, 𝑧𝜋
, 𝐼
 World model decoder
• The world model decoder is very similar to policy
decoder, except that its goal is to estimate
– 𝑝𝜃𝑤
𝑠𝑡+𝑘+1 𝜏𝑡
𝑘
; 𝑧𝑤
, 𝑝𝜃𝑤
𝑟𝑡+𝑘|𝜏𝑡
𝑘
; 𝑧𝑤
, 𝑝𝜃𝑤
𝑅𝑡+𝑘+1|𝜏𝑡
𝑘
; 𝑧𝑤
,
𝑤ℎ𝑒𝑟𝑒 ∀𝑘 ∈ [1, 𝐾]
• The world model decoder is similarly represented with a
causal Transformer and incorporates its latent variable 𝑧𝑤
and output unit-variance isotropic Gaussian dist.
– 𝑝𝜃𝑤
𝜙𝑡+𝑘+1 𝜏𝑡
𝑘
; 𝑧𝑤
≔ 𝒩(𝑓𝑤
𝜙
𝜏𝑡
𝑘
, 𝑧𝑤
, 𝐼 , 𝜙 ∈ [𝑠, 𝑟, 𝑅]
9
 Candidate trajectory generation
 The goal of this phase is to predict a possible continuation of that trajectory over the planning horizon ℎ at current state
𝑠𝑡 and stored history of the last 𝑘 steps of the trajectory
• 𝜏𝑡−𝑘
𝑘+ℎ
= 𝑠𝑡−𝑘, 𝑎𝑡−𝑘, … , 𝑠𝑡, 𝑎𝑡, 𝑠𝑡+1, … , 𝑠𝑡+ℎ, 𝑎𝑡+ℎ
 The authors alternatively make autoregressive predictions from the policy and world models to predict these quantities
• 𝑎𝑡+𝑖 = 𝑓𝜋 𝑠𝑡−𝑘, 𝑎𝑡−𝑘, … , 𝑠𝑡, 𝑎𝑡, … , 𝑠𝑡+𝑖 , 𝑧𝜋
→ 𝑠𝑡+𝑖+1, 𝑟𝑡+𝑖, 𝑅𝑡+𝑖+1 = 𝑓𝑤 𝑠𝑡−𝑘, 𝑎𝑡−𝑘, … , 𝑠𝑡, 𝑎𝑡, … , 𝑠𝑡+𝑖, 𝑎𝑡+𝑖 , 𝑧𝑤
 They repeat this alternating procedure until reaching the horizon length ℎ and compute 𝜏𝑡
ℎ
and its corresponding 𝑅 𝜏𝑡
ℎ
 Action selection
 Thanks to discrete latent variables, SPLT can enumerate all possible combinations of 𝑧𝜋
and 𝑧𝑤
• In the action selection phase, 256 different combinations of latent variables(𝑐 = 2, 𝑛𝑤 ≤ 4, 𝑎𝑛𝑑 𝑛𝜋 ≤ 4) only need to be considered
 In the trajectories, the authors selected the best trajectory that corresponds to
• max
𝑖
min
𝑗
𝑅𝑖𝑗 , 𝑤ℎ𝑒𝑟𝑒 𝑖 ∈ 1, 𝑐𝑛𝜋 , 𝑎𝑛𝑑 𝑗 ∈ 1, 𝑐𝑛𝑤
• The intuition behind this procedure is that the SPLT is trying to pick policy to follow that will be robust to nay realistic possible future
in the current environment
 They executed the first action of 𝜏𝑖∗𝑗∗ and repeat this procedure at every timestep
SPLT Transformer (III) – Planning
10
Experiments
11
 Illustrative example: toy autonomous driving problem
 Vehicle control problem in car-following situation
• Half of the time the leading vehicle will begin hard-braking at the last possible moment(about 70m)
• The other half of the time the leading vehicle will immediately speed up to the maximum speed
• Assumption that the perception and localization systems are well-built
 Environment
 Collected dataset(~100000 steps) with a distribution of different IDM in simulation env
• NoCrash env(based on Carla)
 Benchmark dataset
• D4RL
– HalfCheetah
– Hopper
– Walker2d
Experiment environment
Simulation scenario
Speed
[km/h]
Time [s]
Preceding vehicle
D4RL
NoCrash RL env with Carla simulation environment
12
 Comparison with previous methods in Offline RL tasks
 Experiments showed comparable performance compared to previous SOTA methods in D4RL offline RL
benchmark(HalfCheetah, Hopper, Walker2d)
• Imitation Learning: Behavior Cloning(BC)
• Offline RL: Model-Based Offline Planning(MBOP), Conservative Q-Learning(CQL), DT, TT
• Model-free RL: Implicit Q-Learning(IQL)
 The authors describe that the reason for the low performance in the med-replay dataset setting was that
the dataset contains a limited number of temporally consistent behaviors
Experiment results (I)
Offline RL results in tabular form
13
Experiment results (II) – Learning behavior for self-driving vehicle
 Qualitative analysis in complex stochastic task
 Decision transformer and trajectory transformer are underperformed
• For DT, the authors found that conditioning on the maximum return in the dataset leads to crashes every time the
leading vehicle brakes
• For TT, they found that the results depend heavily on the scope of the search used.
 SPLT Transformer achieved significant results
• They insist that their world VAE was able to predict both possible modes for the leading vehicle’s behavior and the
policy VAE seems to be able to predict a range of different trailing behaviors
𝒕𝒑𝒓𝒆
Transitions
Time [s]
optimal trajectory selection
Return query in DT Beam search in TT Best trajectory selection in SPLT
14
Experiment results (III) – Learning behavior for self-driving vehicle
 Quantitative results
 Experiments showed comparable performance compared to previous SOTA methods
• DT(m): is DT conditioned on the maximum return in the dataset
• DT(e): is DT conditioned on the expected return of the best controller
• DT(t): is DT with a hand-tuned conditional return
• TT(a): is TT with more aggressive search parameters
• IDM(t): is the best controller from the distribution used to collect the data
 They also evaluate the methods in the unseen dataset
• SPLT outperformed the previous Offline RL methods in complex env
• SPLT underperformed compared with IQN
Evaluation results in unseen routes
Training results in tabular form
15
Thank you!
16
Q&A

More Related Content

Similar to SPLT Transformer.pptx

Rapid motor adaptation for legged robots
Rapid motor adaptation for legged robotsRapid motor adaptation for legged robots
Rapid motor adaptation for legged robotsRohit Choudhury
 
Driving Behavior for ADAS and Autonomous Driving VIII
Driving Behavior for ADAS and Autonomous Driving VIIIDriving Behavior for ADAS and Autonomous Driving VIII
Driving Behavior for ADAS and Autonomous Driving VIIIYu Huang
 
Combinatorial optimization and deep reinforcement learning
Combinatorial optimization and deep reinforcement learningCombinatorial optimization and deep reinforcement learning
Combinatorial optimization and deep reinforcement learning민재 정
 
Traffic simulation
Traffic simulationTraffic simulation
Traffic simulationPraful -
 
Paper Study: A learning based iterative method for solving vehicle routing
Paper Study: A learning based iterative method for solving vehicle routingPaper Study: A learning based iterative method for solving vehicle routing
Paper Study: A learning based iterative method for solving vehicle routingChenYiHuang5
 
synopsis vibration analysis of bearing, wheell
synopsis vibration analysis of bearing, wheellsynopsis vibration analysis of bearing, wheell
synopsis vibration analysis of bearing, wheellPrakashDuraisamyCIT
 
Welch Verolog 2013
Welch Verolog 2013Welch Verolog 2013
Welch Verolog 2013Philip Welch
 
Learning agile and dynamic motor skills for legged robots
Learning agile and dynamic motor skills for legged robotsLearning agile and dynamic motor skills for legged robots
Learning agile and dynamic motor skills for legged robots홍배 김
 
Driving Behavior for ADAS and Autonomous Driving V
Driving Behavior for ADAS and Autonomous Driving VDriving Behavior for ADAS and Autonomous Driving V
Driving Behavior for ADAS and Autonomous Driving VYu Huang
 
Maximum Likelihood Estimation of Linear Time-Varying Pilot Model Parameters
Maximum Likelihood Estimation of Linear Time-Varying Pilot Model Parameters Maximum Likelihood Estimation of Linear Time-Varying Pilot Model Parameters
Maximum Likelihood Estimation of Linear Time-Varying Pilot Model Parameters Martin Kers
 
Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...
Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...
Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...Lionel Briand
 
Multiple Vehicle Motion Planning: An Infinite Diminsion Newton Optimization M...
Multiple Vehicle Motion Planning: An Infinite Diminsion Newton Optimization M...Multiple Vehicle Motion Planning: An Infinite Diminsion Newton Optimization M...
Multiple Vehicle Motion Planning: An Infinite Diminsion Newton Optimization M...AJHaeusler
 
History-Aware Explanations: Towards Enabling Human-in-the-Loop in Self-Adapti...
History-Aware Explanations: Towards Enabling Human-in-the-Loop in Self-Adapti...History-Aware Explanations: Towards Enabling Human-in-the-Loop in Self-Adapti...
History-Aware Explanations: Towards Enabling Human-in-the-Loop in Self-Adapti...Antonio García-Domínguez
 

Similar to SPLT Transformer.pptx (20)

Rapid motor adaptation for legged robots
Rapid motor adaptation for legged robotsRapid motor adaptation for legged robots
Rapid motor adaptation for legged robots
 
Driving Behavior for ADAS and Autonomous Driving VIII
Driving Behavior for ADAS and Autonomous Driving VIIIDriving Behavior for ADAS and Autonomous Driving VIII
Driving Behavior for ADAS and Autonomous Driving VIII
 
TransDreamer.pptx
TransDreamer.pptxTransDreamer.pptx
TransDreamer.pptx
 
Combinatorial optimization and deep reinforcement learning
Combinatorial optimization and deep reinforcement learningCombinatorial optimization and deep reinforcement learning
Combinatorial optimization and deep reinforcement learning
 
Traffic simulation
Traffic simulationTraffic simulation
Traffic simulation
 
TINET_FRnOG_2008_public
TINET_FRnOG_2008_publicTINET_FRnOG_2008_public
TINET_FRnOG_2008_public
 
Paper Study: A learning based iterative method for solving vehicle routing
Paper Study: A learning based iterative method for solving vehicle routingPaper Study: A learning based iterative method for solving vehicle routing
Paper Study: A learning based iterative method for solving vehicle routing
 
Comp prese (1)
Comp prese (1)Comp prese (1)
Comp prese (1)
 
synopsis vibration analysis of bearing, wheell
synopsis vibration analysis of bearing, wheellsynopsis vibration analysis of bearing, wheell
synopsis vibration analysis of bearing, wheell
 
Welch Verolog 2013
Welch Verolog 2013Welch Verolog 2013
Welch Verolog 2013
 
Traffic simulation and modelling
Traffic simulation and modellingTraffic simulation and modelling
Traffic simulation and modelling
 
Learning agile and dynamic motor skills for legged robots
Learning agile and dynamic motor skills for legged robotsLearning agile and dynamic motor skills for legged robots
Learning agile and dynamic motor skills for legged robots
 
Strel streaming
Strel streamingStrel streaming
Strel streaming
 
Driving Behavior for ADAS and Autonomous Driving V
Driving Behavior for ADAS and Autonomous Driving VDriving Behavior for ADAS and Autonomous Driving V
Driving Behavior for ADAS and Autonomous Driving V
 
Maximum Likelihood Estimation of Linear Time-Varying Pilot Model Parameters
Maximum Likelihood Estimation of Linear Time-Varying Pilot Model Parameters Maximum Likelihood Estimation of Linear Time-Varying Pilot Model Parameters
Maximum Likelihood Estimation of Linear Time-Varying Pilot Model Parameters
 
Design of robust layout for dynamic plant layout layout problem 1
Design of robust layout for dynamic plant layout layout problem 1Design of robust layout for dynamic plant layout layout problem 1
Design of robust layout for dynamic plant layout layout problem 1
 
Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...
Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...
Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...
 
Multiple Vehicle Motion Planning: An Infinite Diminsion Newton Optimization M...
Multiple Vehicle Motion Planning: An Infinite Diminsion Newton Optimization M...Multiple Vehicle Motion Planning: An Infinite Diminsion Newton Optimization M...
Multiple Vehicle Motion Planning: An Infinite Diminsion Newton Optimization M...
 
RIOH / RWS workshop
RIOH / RWS workshopRIOH / RWS workshop
RIOH / RWS workshop
 
History-Aware Explanations: Towards Enabling Human-in-the-Loop in Self-Adapti...
History-Aware Explanations: Towards Enabling Human-in-the-Loop in Self-Adapti...History-Aware Explanations: Towards Enabling Human-in-the-Loop in Self-Adapti...
History-Aware Explanations: Towards Enabling Human-in-the-Loop in Self-Adapti...
 

Recently uploaded

RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 

Recently uploaded (20)

RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 

SPLT Transformer.pptx

  • 1. 1 Addressing Optimism Bias in Sequence Modeling for RL (SPLT Transformer) 백승언 22 Oct, 2023
  • 2. 2  Introduction  Limitations in previous offline reinforcement learning  SPLT Transformer  Sampling-based planning algorithm  SPLT Transformer  Experiments  Environments  Results Contents
  • 4. 4  Most prior works in offline RL have focused on the mainly deterministic D4RL benchmarks and weakly stochastic Atrai benchmarks  Therefore, there has been limited focus on the difficulties of deploying such methods in largely stochastic domains such as autonomous driving, transportation, and finance  Recently, some works have explored leveraging high-capacity sequence models in sequential decision-making problems.  However, these methods focused on deterministic env and utilized naïve action selection techniques. • The authors supposed that it could lead to overly aggressive and optimistic behavior Limitations in previous offline reinforcement learning Return query in Decision Transformer(DT) Beam search in Trajectory Transformer(TT)
  • 6. 6  Process of the conventional speed planning algorithm in autonomous vehicle  Sampling-based planning algorithm using time gap distribution Sampling-based planning algorithm 𝒕𝒑𝒓𝒆 Speed [km/h] Time [s] Features related to the preceding vehicle • Time gap • Relative distance • Relative speed Features related to the ego-vehicle • Speed • Acceleration • Jerk Optimal speed trajectory selection  Cost-function based evaluation Speed trajectory calculation  Calculating based on vehicle dynamics 𝒕𝒑𝒓𝒆 Speed [km/h] Time [s] Time gap candidate generation  Random sampling and profiling 𝒕𝒑𝒓𝒆 Time [s] Time gap [s] 𝒕𝒑𝒓𝒆 : Prediction time Time gap candidates Speed trajectories Selected optimal speed trajectory
  • 7. 7  Overview of the SPLT Transformer  Existing offlineRL algorithms have generally been applied to deterministic/weakly stochastic environments different largely from the real-world • D4RL benchmark, Atari benchmark, and so on  The proposed model is designed with a separated transformer-based VAE model for predicting the action, observation, reward, and discounted return • Transformer-based encoders encode the transition history for policy decoder and world model decoder • The policy decoder estimates the next action depends on the action excepted transition history • The world model decoder estimates the observation, reward, and discounted return  Additionally, they enhanced the planning technique for offlneRL as a sequence modeling method for addressing the optimistic/sub-optimal behavior • They utilized a sampling-based planning technique which is the selection of the best trajectory within the generated candidate trajectory set  Evaluation demonstrated that SPLT Transformer has outperformed in self-driving tasks which has a large stochasticity in terms of success ratio and generalization performance SPLT Transformer (I) – Overview
  • 8. 8  SeParated Latent Trajectory Transformer(SPLT Transformer)  They designed the separated Transformer-based discrete latent variable VAEs to represent policy and world models SPLT Transformer (II) – Architecture The architecture of SPLT Transformer for generating a reconstruction prediction  Encoders • Both the world encoder 𝑞𝜙𝑤 and policy encoder 𝑞𝜙𝜋 use the same architecture(non masking GPT architecture) and receive the same trajectory 𝜏𝑡 𝐾 – 𝜏𝑡 𝐾 = {𝑠𝑡, 𝑎𝑡, 𝑠𝑡+1, 𝑎𝑡+1, … , 𝑠𝑡+𝐾, 𝑎𝑡+𝑘} • These encoders output a 𝑛𝑤 or 𝑛𝜋 dimensional discrete latent variable with each dimension having 𝑐 possible values – 𝑧𝑡 𝑤 ~𝑞𝜙𝑤 ⋅ 𝜏𝑡 𝐾 , 𝑧𝑡 𝑤 ∈ 1, … , 𝑐 𝑛𝑤 – 𝑧𝑡 𝜋 ~𝑞𝜙𝜋 ⋅ 𝜏𝑡 𝐾 , 𝑧𝑡 𝜋 ∈ 1, … , 𝑐 𝑛𝜋  Policy decoder • The policy decoder uses a similar input trajectory representation and use the causal Transformer – 𝜏𝑡 ′𝑘 = {𝑠𝑡, 𝑎𝑡, 𝑠𝑡+1, 𝑎𝑡+1, … , 𝑠𝑡+𝑘} • Then, the policy decoder takes the latent variable 𝑧𝜋, and output mean of policy distribution which is assumed with isotropic Gaussian – 𝑝𝜃𝜋 𝑎𝑡+𝑘 𝜏𝑡 ′𝑘 ; 𝑧𝜋 ≔ 𝒩 𝑓 𝜋 𝜏𝑡 ′𝑘 , 𝑧𝜋 , 𝐼  World model decoder • The world model decoder is very similar to policy decoder, except that its goal is to estimate – 𝑝𝜃𝑤 𝑠𝑡+𝑘+1 𝜏𝑡 𝑘 ; 𝑧𝑤 , 𝑝𝜃𝑤 𝑟𝑡+𝑘|𝜏𝑡 𝑘 ; 𝑧𝑤 , 𝑝𝜃𝑤 𝑅𝑡+𝑘+1|𝜏𝑡 𝑘 ; 𝑧𝑤 , 𝑤ℎ𝑒𝑟𝑒 ∀𝑘 ∈ [1, 𝐾] • The world model decoder is similarly represented with a causal Transformer and incorporates its latent variable 𝑧𝑤 and output unit-variance isotropic Gaussian dist. – 𝑝𝜃𝑤 𝜙𝑡+𝑘+1 𝜏𝑡 𝑘 ; 𝑧𝑤 ≔ 𝒩(𝑓𝑤 𝜙 𝜏𝑡 𝑘 , 𝑧𝑤 , 𝐼 , 𝜙 ∈ [𝑠, 𝑟, 𝑅]
  • 9. 9  Candidate trajectory generation  The goal of this phase is to predict a possible continuation of that trajectory over the planning horizon ℎ at current state 𝑠𝑡 and stored history of the last 𝑘 steps of the trajectory • 𝜏𝑡−𝑘 𝑘+ℎ = 𝑠𝑡−𝑘, 𝑎𝑡−𝑘, … , 𝑠𝑡, 𝑎𝑡, 𝑠𝑡+1, … , 𝑠𝑡+ℎ, 𝑎𝑡+ℎ  The authors alternatively make autoregressive predictions from the policy and world models to predict these quantities • 𝑎𝑡+𝑖 = 𝑓𝜋 𝑠𝑡−𝑘, 𝑎𝑡−𝑘, … , 𝑠𝑡, 𝑎𝑡, … , 𝑠𝑡+𝑖 , 𝑧𝜋 → 𝑠𝑡+𝑖+1, 𝑟𝑡+𝑖, 𝑅𝑡+𝑖+1 = 𝑓𝑤 𝑠𝑡−𝑘, 𝑎𝑡−𝑘, … , 𝑠𝑡, 𝑎𝑡, … , 𝑠𝑡+𝑖, 𝑎𝑡+𝑖 , 𝑧𝑤  They repeat this alternating procedure until reaching the horizon length ℎ and compute 𝜏𝑡 ℎ and its corresponding 𝑅 𝜏𝑡 ℎ  Action selection  Thanks to discrete latent variables, SPLT can enumerate all possible combinations of 𝑧𝜋 and 𝑧𝑤 • In the action selection phase, 256 different combinations of latent variables(𝑐 = 2, 𝑛𝑤 ≤ 4, 𝑎𝑛𝑑 𝑛𝜋 ≤ 4) only need to be considered  In the trajectories, the authors selected the best trajectory that corresponds to • max 𝑖 min 𝑗 𝑅𝑖𝑗 , 𝑤ℎ𝑒𝑟𝑒 𝑖 ∈ 1, 𝑐𝑛𝜋 , 𝑎𝑛𝑑 𝑗 ∈ 1, 𝑐𝑛𝑤 • The intuition behind this procedure is that the SPLT is trying to pick policy to follow that will be robust to nay realistic possible future in the current environment  They executed the first action of 𝜏𝑖∗𝑗∗ and repeat this procedure at every timestep SPLT Transformer (III) – Planning
  • 11. 11  Illustrative example: toy autonomous driving problem  Vehicle control problem in car-following situation • Half of the time the leading vehicle will begin hard-braking at the last possible moment(about 70m) • The other half of the time the leading vehicle will immediately speed up to the maximum speed • Assumption that the perception and localization systems are well-built  Environment  Collected dataset(~100000 steps) with a distribution of different IDM in simulation env • NoCrash env(based on Carla)  Benchmark dataset • D4RL – HalfCheetah – Hopper – Walker2d Experiment environment Simulation scenario Speed [km/h] Time [s] Preceding vehicle D4RL NoCrash RL env with Carla simulation environment
  • 12. 12  Comparison with previous methods in Offline RL tasks  Experiments showed comparable performance compared to previous SOTA methods in D4RL offline RL benchmark(HalfCheetah, Hopper, Walker2d) • Imitation Learning: Behavior Cloning(BC) • Offline RL: Model-Based Offline Planning(MBOP), Conservative Q-Learning(CQL), DT, TT • Model-free RL: Implicit Q-Learning(IQL)  The authors describe that the reason for the low performance in the med-replay dataset setting was that the dataset contains a limited number of temporally consistent behaviors Experiment results (I) Offline RL results in tabular form
  • 13. 13 Experiment results (II) – Learning behavior for self-driving vehicle  Qualitative analysis in complex stochastic task  Decision transformer and trajectory transformer are underperformed • For DT, the authors found that conditioning on the maximum return in the dataset leads to crashes every time the leading vehicle brakes • For TT, they found that the results depend heavily on the scope of the search used.  SPLT Transformer achieved significant results • They insist that their world VAE was able to predict both possible modes for the leading vehicle’s behavior and the policy VAE seems to be able to predict a range of different trailing behaviors 𝒕𝒑𝒓𝒆 Transitions Time [s] optimal trajectory selection Return query in DT Beam search in TT Best trajectory selection in SPLT
  • 14. 14 Experiment results (III) – Learning behavior for self-driving vehicle  Quantitative results  Experiments showed comparable performance compared to previous SOTA methods • DT(m): is DT conditioned on the maximum return in the dataset • DT(e): is DT conditioned on the expected return of the best controller • DT(t): is DT with a hand-tuned conditional return • TT(a): is TT with more aggressive search parameters • IDM(t): is the best controller from the distribution used to collect the data  They also evaluate the methods in the unseen dataset • SPLT outperformed the previous Offline RL methods in complex env • SPLT underperformed compared with IQN Evaluation results in unseen routes Training results in tabular form

Editor's Notes

  1. 이제 본격적인 제가 선정한 논문의 알고리즘에 대해서 발표를 시작하겠습니다.