SlideShare a Scribd company logo
1 of 16
1
Addressing Optimism Bias in
Sequence Modeling for RL
(SPLT Transformer)
백승언
22 Oct, 2023
2
 Introduction
 Limitations in previous offline reinforcement learning
 SPLT Transformer
 Sampling-based planning algorithm
 SPLT Transformer
 Experiments
 Environments
 Results
Contents
3
Introduction
4
 Most prior works in offline RL have focused on the mainly deterministic D4RL benchmarks
and weakly stochastic Atrai benchmarks
 Therefore, there has been limited focus on the difficulties of deploying such methods in largely stochastic
domains such as autonomous driving, transportation, and finance
 Recently, some works have explored leveraging high-capacity sequence models in
sequential decision-making problems.
 However, these methods focused on deterministic env and utilized naïve action selection techniques.
• The authors supposed that it could lead to overly aggressive and optimistic behavior
Limitations in previous offline reinforcement learning
Return query in Decision Transformer(DT) Beam search in Trajectory Transformer(TT)
5
SPLT Transformer
6
 Process of the conventional speed planning algorithm in autonomous vehicle
 Sampling-based planning algorithm using time gap distribution
Sampling-based planning algorithm
𝒕𝒑𝒓𝒆
Speed
[km/h]
Time [s]
Features related to
the preceding vehicle
• Time gap
• Relative distance
• Relative speed
Features related to
the ego-vehicle
• Speed
• Acceleration
• Jerk
Optimal speed trajectory selection
 Cost-function based evaluation
Speed trajectory calculation
 Calculating based on
vehicle dynamics
𝒕𝒑𝒓𝒆
Speed
[km/h]
Time [s]
Time gap candidate generation
 Random sampling and profiling
𝒕𝒑𝒓𝒆
Time [s]
Time
gap
[s]
𝒕𝒑𝒓𝒆 : Prediction time
Time gap candidates
Speed trajectories
Selected optimal speed trajectory
7
 Overview of the SPLT Transformer
 Existing offlineRL algorithms have generally been applied to deterministic/weakly stochastic environments
different largely from the real-world
• D4RL benchmark, Atari benchmark, and so on
 The proposed model is designed with a separated transformer-based VAE model for predicting the action,
observation, reward, and discounted return
• Transformer-based encoders encode the transition history for policy decoder and world model decoder
• The policy decoder estimates the next action depends on the action excepted transition history
• The world model decoder estimates the observation, reward, and discounted return
 Additionally, they enhanced the planning technique for offlneRL as a sequence modeling method for
addressing the optimistic/sub-optimal behavior
• They utilized a sampling-based planning technique which is the selection of the best trajectory within the generated
candidate trajectory set
 Evaluation demonstrated that SPLT Transformer has outperformed in self-driving tasks which has a large
stochasticity in terms of success ratio and generalization performance
SPLT Transformer (I) – Overview
8
 SeParated Latent Trajectory Transformer(SPLT Transformer)
 They designed the separated Transformer-based discrete latent variable VAEs to represent policy and world models
SPLT Transformer (II) – Architecture
The architecture of SPLT Transformer for generating a reconstruction prediction
 Encoders
• Both the world encoder 𝑞𝜙𝑤
and policy encoder 𝑞𝜙𝜋
use
the same architecture(non masking GPT architecture)
and receive the same trajectory 𝜏𝑡
𝐾
– 𝜏𝑡
𝐾
= {𝑠𝑡, 𝑎𝑡, 𝑠𝑡+1, 𝑎𝑡+1, … , 𝑠𝑡+𝐾, 𝑎𝑡+𝑘}
• These encoders output a 𝑛𝑤 or 𝑛𝜋 dimensional discrete
latent variable with each dimension having 𝑐 possible
values
– 𝑧𝑡
𝑤
~𝑞𝜙𝑤
⋅ 𝜏𝑡
𝐾
, 𝑧𝑡
𝑤
∈ 1, … , 𝑐 𝑛𝑤
– 𝑧𝑡
𝜋
~𝑞𝜙𝜋
⋅ 𝜏𝑡
𝐾
, 𝑧𝑡
𝜋
∈ 1, … , 𝑐 𝑛𝜋
 Policy decoder
• The policy decoder uses a similar input trajectory
representation and use the causal Transformer
– 𝜏𝑡
′𝑘
= {𝑠𝑡, 𝑎𝑡, 𝑠𝑡+1, 𝑎𝑡+1, … , 𝑠𝑡+𝑘}
• Then, the policy decoder takes the latent variable 𝑧𝜋, and
output mean of policy distribution which is assumed with
isotropic Gaussian
– 𝑝𝜃𝜋
𝑎𝑡+𝑘 𝜏𝑡
′𝑘
; 𝑧𝜋
≔ 𝒩 𝑓
𝜋 𝜏𝑡
′𝑘
, 𝑧𝜋
, 𝐼
 World model decoder
• The world model decoder is very similar to policy
decoder, except that its goal is to estimate
– 𝑝𝜃𝑤
𝑠𝑡+𝑘+1 𝜏𝑡
𝑘
; 𝑧𝑤
, 𝑝𝜃𝑤
𝑟𝑡+𝑘|𝜏𝑡
𝑘
; 𝑧𝑤
, 𝑝𝜃𝑤
𝑅𝑡+𝑘+1|𝜏𝑡
𝑘
; 𝑧𝑤
,
𝑤ℎ𝑒𝑟𝑒 ∀𝑘 ∈ [1, 𝐾]
• The world model decoder is similarly represented with a
causal Transformer and incorporates its latent variable 𝑧𝑤
and output unit-variance isotropic Gaussian dist.
– 𝑝𝜃𝑤
𝜙𝑡+𝑘+1 𝜏𝑡
𝑘
; 𝑧𝑤
≔ 𝒩(𝑓𝑤
𝜙
𝜏𝑡
𝑘
, 𝑧𝑤
, 𝐼 , 𝜙 ∈ [𝑠, 𝑟, 𝑅]
9
 Candidate trajectory generation
 The goal of this phase is to predict a possible continuation of that trajectory over the planning horizon ℎ at current state
𝑠𝑡 and stored history of the last 𝑘 steps of the trajectory
• 𝜏𝑡−𝑘
𝑘+ℎ
= 𝑠𝑡−𝑘, 𝑎𝑡−𝑘, … , 𝑠𝑡, 𝑎𝑡, 𝑠𝑡+1, … , 𝑠𝑡+ℎ, 𝑎𝑡+ℎ
 The authors alternatively make autoregressive predictions from the policy and world models to predict these quantities
• 𝑎𝑡+𝑖 = 𝑓𝜋 𝑠𝑡−𝑘, 𝑎𝑡−𝑘, … , 𝑠𝑡, 𝑎𝑡, … , 𝑠𝑡+𝑖 , 𝑧𝜋
→ 𝑠𝑡+𝑖+1, 𝑟𝑡+𝑖, 𝑅𝑡+𝑖+1 = 𝑓𝑤 𝑠𝑡−𝑘, 𝑎𝑡−𝑘, … , 𝑠𝑡, 𝑎𝑡, … , 𝑠𝑡+𝑖, 𝑎𝑡+𝑖 , 𝑧𝑤
 They repeat this alternating procedure until reaching the horizon length ℎ and compute 𝜏𝑡
ℎ
and its corresponding 𝑅 𝜏𝑡
ℎ
 Action selection
 Thanks to discrete latent variables, SPLT can enumerate all possible combinations of 𝑧𝜋
and 𝑧𝑤
• In the action selection phase, 256 different combinations of latent variables(𝑐 = 2, 𝑛𝑤 ≤ 4, 𝑎𝑛𝑑 𝑛𝜋 ≤ 4) only need to be considered
 In the trajectories, the authors selected the best trajectory that corresponds to
• max
𝑖
min
𝑗
𝑅𝑖𝑗 , 𝑤ℎ𝑒𝑟𝑒 𝑖 ∈ 1, 𝑐𝑛𝜋 , 𝑎𝑛𝑑 𝑗 ∈ 1, 𝑐𝑛𝑤
• The intuition behind this procedure is that the SPLT is trying to pick policy to follow that will be robust to nay realistic possible future
in the current environment
 They executed the first action of 𝜏𝑖∗𝑗∗ and repeat this procedure at every timestep
SPLT Transformer (III) – Planning
10
Experiments
11
 Illustrative example: toy autonomous driving problem
 Vehicle control problem in car-following situation
• Half of the time the leading vehicle will begin hard-braking at the last possible moment(about 70m)
• The other half of the time the leading vehicle will immediately speed up to the maximum speed
• Assumption that the perception and localization systems are well-built
 Environment
 Collected dataset(~100000 steps) with a distribution of different IDM in simulation env
• NoCrash env(based on Carla)
 Benchmark dataset
• D4RL
– HalfCheetah
– Hopper
– Walker2d
Experiment environment
Simulation scenario
Speed
[km/h]
Time [s]
Preceding vehicle
D4RL
NoCrash RL env with Carla simulation environment
12
 Comparison with previous methods in Offline RL tasks
 Experiments showed comparable performance compared to previous SOTA methods in D4RL offline RL
benchmark(HalfCheetah, Hopper, Walker2d)
• Imitation Learning: Behavior Cloning(BC)
• Offline RL: Model-Based Offline Planning(MBOP), Conservative Q-Learning(CQL), DT, TT
• Model-free RL: Implicit Q-Learning(IQL)
 The authors describe that the reason for the low performance in the med-replay dataset setting was that
the dataset contains a limited number of temporally consistent behaviors
Experiment results (I)
Offline RL results in tabular form
13
Experiment results (II) – Learning behavior for self-driving vehicle
 Qualitative analysis in complex stochastic task
 Decision transformer and trajectory transformer are underperformed
• For DT, the authors found that conditioning on the maximum return in the dataset leads to crashes every time the
leading vehicle brakes
• For TT, they found that the results depend heavily on the scope of the search used.
 SPLT Transformer achieved significant results
• They insist that their world VAE was able to predict both possible modes for the leading vehicle’s behavior and the
policy VAE seems to be able to predict a range of different trailing behaviors
𝒕𝒑𝒓𝒆
Transitions
Time [s]
optimal trajectory selection
Return query in DT Beam search in TT Best trajectory selection in SPLT
14
Experiment results (III) – Learning behavior for self-driving vehicle
 Quantitative results
 Experiments showed comparable performance compared to previous SOTA methods
• DT(m): is DT conditioned on the maximum return in the dataset
• DT(e): is DT conditioned on the expected return of the best controller
• DT(t): is DT with a hand-tuned conditional return
• TT(a): is TT with more aggressive search parameters
• IDM(t): is the best controller from the distribution used to collect the data
 They also evaluate the methods in the unseen dataset
• SPLT outperformed the previous Offline RL methods in complex env
• SPLT underperformed compared with IQN
Evaluation results in unseen routes
Training results in tabular form
15
Thank you!
16
Q&A

More Related Content

Similar to SPLT Transformer.pptx

Rapid motor adaptation for legged robots
Rapid motor adaptation for legged robotsRapid motor adaptation for legged robots
Rapid motor adaptation for legged robotsRohit Choudhury
 
Driving Behavior for ADAS and Autonomous Driving VIII
Driving Behavior for ADAS and Autonomous Driving VIIIDriving Behavior for ADAS and Autonomous Driving VIII
Driving Behavior for ADAS and Autonomous Driving VIIIYu Huang
 
Combinatorial optimization and deep reinforcement learning
Combinatorial optimization and deep reinforcement learningCombinatorial optimization and deep reinforcement learning
Combinatorial optimization and deep reinforcement learning민재 정
 
Traffic simulation
Traffic simulationTraffic simulation
Traffic simulationPraful -
 
Paper Study: A learning based iterative method for solving vehicle routing
Paper Study: A learning based iterative method for solving vehicle routingPaper Study: A learning based iterative method for solving vehicle routing
Paper Study: A learning based iterative method for solving vehicle routingChenYiHuang5
 
synopsis vibration analysis of bearing, wheell
synopsis vibration analysis of bearing, wheellsynopsis vibration analysis of bearing, wheell
synopsis vibration analysis of bearing, wheellPrakashDuraisamyCIT
 
Welch Verolog 2013
Welch Verolog 2013Welch Verolog 2013
Welch Verolog 2013Philip Welch
 
Learning agile and dynamic motor skills for legged robots
Learning agile and dynamic motor skills for legged robotsLearning agile and dynamic motor skills for legged robots
Learning agile and dynamic motor skills for legged robots홍배 김
 
Driving Behavior for ADAS and Autonomous Driving V
Driving Behavior for ADAS and Autonomous Driving VDriving Behavior for ADAS and Autonomous Driving V
Driving Behavior for ADAS and Autonomous Driving VYu Huang
 
Maximum Likelihood Estimation of Linear Time-Varying Pilot Model Parameters
Maximum Likelihood Estimation of Linear Time-Varying Pilot Model Parameters Maximum Likelihood Estimation of Linear Time-Varying Pilot Model Parameters
Maximum Likelihood Estimation of Linear Time-Varying Pilot Model Parameters Martin Kers
 
Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...
Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...
Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...Lionel Briand
 
Multiple Vehicle Motion Planning: An Infinite Diminsion Newton Optimization M...
Multiple Vehicle Motion Planning: An Infinite Diminsion Newton Optimization M...Multiple Vehicle Motion Planning: An Infinite Diminsion Newton Optimization M...
Multiple Vehicle Motion Planning: An Infinite Diminsion Newton Optimization M...AJHaeusler
 
History-Aware Explanations: Towards Enabling Human-in-the-Loop in Self-Adapti...
History-Aware Explanations: Towards Enabling Human-in-the-Loop in Self-Adapti...History-Aware Explanations: Towards Enabling Human-in-the-Loop in Self-Adapti...
History-Aware Explanations: Towards Enabling Human-in-the-Loop in Self-Adapti...Antonio García-Domínguez
 

Similar to SPLT Transformer.pptx (20)

Rapid motor adaptation for legged robots
Rapid motor adaptation for legged robotsRapid motor adaptation for legged robots
Rapid motor adaptation for legged robots
 
Driving Behavior for ADAS and Autonomous Driving VIII
Driving Behavior for ADAS and Autonomous Driving VIIIDriving Behavior for ADAS and Autonomous Driving VIII
Driving Behavior for ADAS and Autonomous Driving VIII
 
TransDreamer.pptx
TransDreamer.pptxTransDreamer.pptx
TransDreamer.pptx
 
Combinatorial optimization and deep reinforcement learning
Combinatorial optimization and deep reinforcement learningCombinatorial optimization and deep reinforcement learning
Combinatorial optimization and deep reinforcement learning
 
Traffic simulation
Traffic simulationTraffic simulation
Traffic simulation
 
TINET_FRnOG_2008_public
TINET_FRnOG_2008_publicTINET_FRnOG_2008_public
TINET_FRnOG_2008_public
 
Paper Study: A learning based iterative method for solving vehicle routing
Paper Study: A learning based iterative method for solving vehicle routingPaper Study: A learning based iterative method for solving vehicle routing
Paper Study: A learning based iterative method for solving vehicle routing
 
Comp prese (1)
Comp prese (1)Comp prese (1)
Comp prese (1)
 
synopsis vibration analysis of bearing, wheell
synopsis vibration analysis of bearing, wheellsynopsis vibration analysis of bearing, wheell
synopsis vibration analysis of bearing, wheell
 
Welch Verolog 2013
Welch Verolog 2013Welch Verolog 2013
Welch Verolog 2013
 
Traffic simulation and modelling
Traffic simulation and modellingTraffic simulation and modelling
Traffic simulation and modelling
 
Learning agile and dynamic motor skills for legged robots
Learning agile and dynamic motor skills for legged robotsLearning agile and dynamic motor skills for legged robots
Learning agile and dynamic motor skills for legged robots
 
Strel streaming
Strel streamingStrel streaming
Strel streaming
 
Driving Behavior for ADAS and Autonomous Driving V
Driving Behavior for ADAS and Autonomous Driving VDriving Behavior for ADAS and Autonomous Driving V
Driving Behavior for ADAS and Autonomous Driving V
 
Maximum Likelihood Estimation of Linear Time-Varying Pilot Model Parameters
Maximum Likelihood Estimation of Linear Time-Varying Pilot Model Parameters Maximum Likelihood Estimation of Linear Time-Varying Pilot Model Parameters
Maximum Likelihood Estimation of Linear Time-Varying Pilot Model Parameters
 
Design of robust layout for dynamic plant layout layout problem 1
Design of robust layout for dynamic plant layout layout problem 1Design of robust layout for dynamic plant layout layout problem 1
Design of robust layout for dynamic plant layout layout problem 1
 
Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...
Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...
Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...
 
Multiple Vehicle Motion Planning: An Infinite Diminsion Newton Optimization M...
Multiple Vehicle Motion Planning: An Infinite Diminsion Newton Optimization M...Multiple Vehicle Motion Planning: An Infinite Diminsion Newton Optimization M...
Multiple Vehicle Motion Planning: An Infinite Diminsion Newton Optimization M...
 
RIOH / RWS workshop
RIOH / RWS workshopRIOH / RWS workshop
RIOH / RWS workshop
 
History-Aware Explanations: Towards Enabling Human-in-the-Loop in Self-Adapti...
History-Aware Explanations: Towards Enabling Human-in-the-Loop in Self-Adapti...History-Aware Explanations: Towards Enabling Human-in-the-Loop in Self-Adapti...
History-Aware Explanations: Towards Enabling Human-in-the-Loop in Self-Adapti...
 

Recently uploaded

Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Onlineanilsa9823
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 

Recently uploaded (20)

Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 

SPLT Transformer.pptx

  • 1. 1 Addressing Optimism Bias in Sequence Modeling for RL (SPLT Transformer) 백승언 22 Oct, 2023
  • 2. 2  Introduction  Limitations in previous offline reinforcement learning  SPLT Transformer  Sampling-based planning algorithm  SPLT Transformer  Experiments  Environments  Results Contents
  • 4. 4  Most prior works in offline RL have focused on the mainly deterministic D4RL benchmarks and weakly stochastic Atrai benchmarks  Therefore, there has been limited focus on the difficulties of deploying such methods in largely stochastic domains such as autonomous driving, transportation, and finance  Recently, some works have explored leveraging high-capacity sequence models in sequential decision-making problems.  However, these methods focused on deterministic env and utilized naïve action selection techniques. • The authors supposed that it could lead to overly aggressive and optimistic behavior Limitations in previous offline reinforcement learning Return query in Decision Transformer(DT) Beam search in Trajectory Transformer(TT)
  • 6. 6  Process of the conventional speed planning algorithm in autonomous vehicle  Sampling-based planning algorithm using time gap distribution Sampling-based planning algorithm 𝒕𝒑𝒓𝒆 Speed [km/h] Time [s] Features related to the preceding vehicle • Time gap • Relative distance • Relative speed Features related to the ego-vehicle • Speed • Acceleration • Jerk Optimal speed trajectory selection  Cost-function based evaluation Speed trajectory calculation  Calculating based on vehicle dynamics 𝒕𝒑𝒓𝒆 Speed [km/h] Time [s] Time gap candidate generation  Random sampling and profiling 𝒕𝒑𝒓𝒆 Time [s] Time gap [s] 𝒕𝒑𝒓𝒆 : Prediction time Time gap candidates Speed trajectories Selected optimal speed trajectory
  • 7. 7  Overview of the SPLT Transformer  Existing offlineRL algorithms have generally been applied to deterministic/weakly stochastic environments different largely from the real-world • D4RL benchmark, Atari benchmark, and so on  The proposed model is designed with a separated transformer-based VAE model for predicting the action, observation, reward, and discounted return • Transformer-based encoders encode the transition history for policy decoder and world model decoder • The policy decoder estimates the next action depends on the action excepted transition history • The world model decoder estimates the observation, reward, and discounted return  Additionally, they enhanced the planning technique for offlneRL as a sequence modeling method for addressing the optimistic/sub-optimal behavior • They utilized a sampling-based planning technique which is the selection of the best trajectory within the generated candidate trajectory set  Evaluation demonstrated that SPLT Transformer has outperformed in self-driving tasks which has a large stochasticity in terms of success ratio and generalization performance SPLT Transformer (I) – Overview
  • 8. 8  SeParated Latent Trajectory Transformer(SPLT Transformer)  They designed the separated Transformer-based discrete latent variable VAEs to represent policy and world models SPLT Transformer (II) – Architecture The architecture of SPLT Transformer for generating a reconstruction prediction  Encoders • Both the world encoder 𝑞𝜙𝑤 and policy encoder 𝑞𝜙𝜋 use the same architecture(non masking GPT architecture) and receive the same trajectory 𝜏𝑡 𝐾 – 𝜏𝑡 𝐾 = {𝑠𝑡, 𝑎𝑡, 𝑠𝑡+1, 𝑎𝑡+1, … , 𝑠𝑡+𝐾, 𝑎𝑡+𝑘} • These encoders output a 𝑛𝑤 or 𝑛𝜋 dimensional discrete latent variable with each dimension having 𝑐 possible values – 𝑧𝑡 𝑤 ~𝑞𝜙𝑤 ⋅ 𝜏𝑡 𝐾 , 𝑧𝑡 𝑤 ∈ 1, … , 𝑐 𝑛𝑤 – 𝑧𝑡 𝜋 ~𝑞𝜙𝜋 ⋅ 𝜏𝑡 𝐾 , 𝑧𝑡 𝜋 ∈ 1, … , 𝑐 𝑛𝜋  Policy decoder • The policy decoder uses a similar input trajectory representation and use the causal Transformer – 𝜏𝑡 ′𝑘 = {𝑠𝑡, 𝑎𝑡, 𝑠𝑡+1, 𝑎𝑡+1, … , 𝑠𝑡+𝑘} • Then, the policy decoder takes the latent variable 𝑧𝜋, and output mean of policy distribution which is assumed with isotropic Gaussian – 𝑝𝜃𝜋 𝑎𝑡+𝑘 𝜏𝑡 ′𝑘 ; 𝑧𝜋 ≔ 𝒩 𝑓 𝜋 𝜏𝑡 ′𝑘 , 𝑧𝜋 , 𝐼  World model decoder • The world model decoder is very similar to policy decoder, except that its goal is to estimate – 𝑝𝜃𝑤 𝑠𝑡+𝑘+1 𝜏𝑡 𝑘 ; 𝑧𝑤 , 𝑝𝜃𝑤 𝑟𝑡+𝑘|𝜏𝑡 𝑘 ; 𝑧𝑤 , 𝑝𝜃𝑤 𝑅𝑡+𝑘+1|𝜏𝑡 𝑘 ; 𝑧𝑤 , 𝑤ℎ𝑒𝑟𝑒 ∀𝑘 ∈ [1, 𝐾] • The world model decoder is similarly represented with a causal Transformer and incorporates its latent variable 𝑧𝑤 and output unit-variance isotropic Gaussian dist. – 𝑝𝜃𝑤 𝜙𝑡+𝑘+1 𝜏𝑡 𝑘 ; 𝑧𝑤 ≔ 𝒩(𝑓𝑤 𝜙 𝜏𝑡 𝑘 , 𝑧𝑤 , 𝐼 , 𝜙 ∈ [𝑠, 𝑟, 𝑅]
  • 9. 9  Candidate trajectory generation  The goal of this phase is to predict a possible continuation of that trajectory over the planning horizon ℎ at current state 𝑠𝑡 and stored history of the last 𝑘 steps of the trajectory • 𝜏𝑡−𝑘 𝑘+ℎ = 𝑠𝑡−𝑘, 𝑎𝑡−𝑘, … , 𝑠𝑡, 𝑎𝑡, 𝑠𝑡+1, … , 𝑠𝑡+ℎ, 𝑎𝑡+ℎ  The authors alternatively make autoregressive predictions from the policy and world models to predict these quantities • 𝑎𝑡+𝑖 = 𝑓𝜋 𝑠𝑡−𝑘, 𝑎𝑡−𝑘, … , 𝑠𝑡, 𝑎𝑡, … , 𝑠𝑡+𝑖 , 𝑧𝜋 → 𝑠𝑡+𝑖+1, 𝑟𝑡+𝑖, 𝑅𝑡+𝑖+1 = 𝑓𝑤 𝑠𝑡−𝑘, 𝑎𝑡−𝑘, … , 𝑠𝑡, 𝑎𝑡, … , 𝑠𝑡+𝑖, 𝑎𝑡+𝑖 , 𝑧𝑤  They repeat this alternating procedure until reaching the horizon length ℎ and compute 𝜏𝑡 ℎ and its corresponding 𝑅 𝜏𝑡 ℎ  Action selection  Thanks to discrete latent variables, SPLT can enumerate all possible combinations of 𝑧𝜋 and 𝑧𝑤 • In the action selection phase, 256 different combinations of latent variables(𝑐 = 2, 𝑛𝑤 ≤ 4, 𝑎𝑛𝑑 𝑛𝜋 ≤ 4) only need to be considered  In the trajectories, the authors selected the best trajectory that corresponds to • max 𝑖 min 𝑗 𝑅𝑖𝑗 , 𝑤ℎ𝑒𝑟𝑒 𝑖 ∈ 1, 𝑐𝑛𝜋 , 𝑎𝑛𝑑 𝑗 ∈ 1, 𝑐𝑛𝑤 • The intuition behind this procedure is that the SPLT is trying to pick policy to follow that will be robust to nay realistic possible future in the current environment  They executed the first action of 𝜏𝑖∗𝑗∗ and repeat this procedure at every timestep SPLT Transformer (III) – Planning
  • 11. 11  Illustrative example: toy autonomous driving problem  Vehicle control problem in car-following situation • Half of the time the leading vehicle will begin hard-braking at the last possible moment(about 70m) • The other half of the time the leading vehicle will immediately speed up to the maximum speed • Assumption that the perception and localization systems are well-built  Environment  Collected dataset(~100000 steps) with a distribution of different IDM in simulation env • NoCrash env(based on Carla)  Benchmark dataset • D4RL – HalfCheetah – Hopper – Walker2d Experiment environment Simulation scenario Speed [km/h] Time [s] Preceding vehicle D4RL NoCrash RL env with Carla simulation environment
  • 12. 12  Comparison with previous methods in Offline RL tasks  Experiments showed comparable performance compared to previous SOTA methods in D4RL offline RL benchmark(HalfCheetah, Hopper, Walker2d) • Imitation Learning: Behavior Cloning(BC) • Offline RL: Model-Based Offline Planning(MBOP), Conservative Q-Learning(CQL), DT, TT • Model-free RL: Implicit Q-Learning(IQL)  The authors describe that the reason for the low performance in the med-replay dataset setting was that the dataset contains a limited number of temporally consistent behaviors Experiment results (I) Offline RL results in tabular form
  • 13. 13 Experiment results (II) – Learning behavior for self-driving vehicle  Qualitative analysis in complex stochastic task  Decision transformer and trajectory transformer are underperformed • For DT, the authors found that conditioning on the maximum return in the dataset leads to crashes every time the leading vehicle brakes • For TT, they found that the results depend heavily on the scope of the search used.  SPLT Transformer achieved significant results • They insist that their world VAE was able to predict both possible modes for the leading vehicle’s behavior and the policy VAE seems to be able to predict a range of different trailing behaviors 𝒕𝒑𝒓𝒆 Transitions Time [s] optimal trajectory selection Return query in DT Beam search in TT Best trajectory selection in SPLT
  • 14. 14 Experiment results (III) – Learning behavior for self-driving vehicle  Quantitative results  Experiments showed comparable performance compared to previous SOTA methods • DT(m): is DT conditioned on the maximum return in the dataset • DT(e): is DT conditioned on the expected return of the best controller • DT(t): is DT with a hand-tuned conditional return • TT(a): is TT with more aggressive search parameters • IDM(t): is the best controller from the distribution used to collect the data  They also evaluate the methods in the unseen dataset • SPLT outperformed the previous Offline RL methods in complex env • SPLT underperformed compared with IQN Evaluation results in unseen routes Training results in tabular form

Editor's Notes

  1. 이제 본격적인 제가 선정한 논문의 알고리즘에 대해서 발표를 시작하겠습니다.