Towards Physically Interpretable World Models:
Meaningful Weakly Supervised Representations for Visual Trajectory Prediction
Zhenjiang Mao, Ivan Ruchkin
Trustworthy Engineered Autonomy (TEA) Lab
Department of Electrical and Computer Engineering, University of Florida
PROBLEM CONTRIBUTIONS RESULTS
Evidence Lower Bound, ELBO
APPROACH
Standard
Image
Encoder
v
1
v
2
…
…
…
…
…
v
32
v
1
v
2
…
…
…
…
…
v
32
…
…
…
…
…
lack explicit physical interpretation
Dynamics-based Prediction
Given:
- State set 𝑋, action set 𝐴, and
controller ℎ : 𝑌 → 𝐴
- Dynamics function 𝜙 (with
parameters 𝜓 ) : 𝑋 × 𝐴 × Ψ → 𝑋
Challenge 1: Unknown states and
parameters of dynamical models
Image-based Prediction
Image forecaster: 𝑦𝑡+1
= 𝑓pred
(𝑦𝑡
, 𝑎𝑡
)
Latent forecaster: Encoding: 𝑧𝑡 −𝑚:𝑡
= 𝑓enc
(𝑦𝑡 −𝑚:𝑡
)
Prediction: 𝑧𝑡+1:𝑡+𝑛
= 𝑓pred
(𝑧𝑡 −𝑚:𝑡
)
Decoding: 𝑦𝑡+1:𝑡+𝑛
= 𝑓dec
(𝑧𝑡+1:𝑡+𝑛
)
Challenge 2: Latent representations do not have physical meaning
Positions
Angular rates,
accelerations,
orientations
Problem Definition
Interpretable
Image
Encoder
Given weak supervision signals
(interval of states), minimize
1. Observation prediction error:
2. State prediction error:
■ A novel learning architecture designed to encode physically
meaningful representations from high-dimensional
observations in closed-loop systems.
■ A training pipeline that effectively incorporates weak
supervision and accommodates unknown dynamics.
■ Experimental results that demonstrate superior performance
in both physical interpretability and predictive accuracy
compared to traditional world models.
World model vs. dynamical loop Physically Interpretable World Model (PIWM)
FUTURE WORK
REFERENCES
■ Evidence Lower Bound (ELBO) of standard VAE:
Eq(z∣y)
[log p(y∣z)] is the expected log-likelihood.
● q(z∣y) is the approximate posterior to
approximate the true posterior p(z∣y)
The KL divergence quantifies the difference
between the two distributions q(z∣y) and N(0,1):
■ Added weak supervision loss as a physically informative prior:
■ Added control action prediction loss (cross-entropy of control actions on the
original images and reconstructed images) :
~
■ Transformed weak supervision into a Gaussian for VAE compatibility:
Position Error of Cart Pole
Interval Length: 2.5% Interval Length: 5% Interval Length: 10%
Angle Error of Cart Pole
Interval Length: 2.5% Interval Length: 5% Interval Length: 10%
MSE Comparison
SSIM Comparison
CONCLUSION
Model Comparison: PIWM w.a., leveraging weak supervision and physical constraints,
achieves the best performance, followed by PIWM wo.a., while VAE-WM and betaVAE-WM
show significant limitations in capturing dynamic characteristics.
Performance of Different Interval Length: As the interval length increases, PIWM models
maintain low error and robustness, whereas VAE-WM and betaVAE-WM exhibit significant
error growth, revealing their limitations in handling long-horizon tasks.
Summary of MSE Results: PIWM w.a. achieves the lowest MSE across all interval lengths,
demonstrating superior accuracy and robustness, while PIWM wo.a., VAE WM, and β-VAE
WM show comparatively higher errors.
Summary of Structural Similarity Index Measure (SSIM) Results: PIWM w.a. consistently
delivers the highest SSIM values, preserving structural integrity across all intervals,
outperforming PIWM wo.a., VAE WM, and β-VAE WM.
- Incorporate finer-grained physical constraints for improved performance.
- Develop adaptive supervision techniques to enhance flexibility.
- Explore varying learning rates for extended prediction horizons.
- Leverage physical invariances, e.g. symmetries and conservation laws, to boost robustness.
- Broaden applicability to diverse systems with varying levels of observability and supervision.
Kingma D P. Auto-encoding variational bayes, 2013.
Higgins I, Matthey L, Pal A, et al. beta-vae: Learning basic visual concepts with a constrained variational framework, 2017.
Ha and Schmidhuber, "Recurrent World Models Facilitate Policy Evolution", 2018.
Mao Z, Dai S, Geng Y, et al. Zero-shot Safety Prediction for Autonomous Robots with Foundation World Models, 2024.
Prediction Horizon Prediction Horizon Prediction Horizon

Towards Physically Interpretable World Models: Meaningful Weakly Supervised Representations for Visual Trajectory Prediction

  • 1.
    Towards Physically InterpretableWorld Models: Meaningful Weakly Supervised Representations for Visual Trajectory Prediction Zhenjiang Mao, Ivan Ruchkin Trustworthy Engineered Autonomy (TEA) Lab Department of Electrical and Computer Engineering, University of Florida PROBLEM CONTRIBUTIONS RESULTS Evidence Lower Bound, ELBO APPROACH Standard Image Encoder v 1 v 2 … … … … … v 32 v 1 v 2 … … … … … v 32 … … … … … lack explicit physical interpretation Dynamics-based Prediction Given: - State set 𝑋, action set 𝐴, and controller ℎ : 𝑌 → 𝐴 - Dynamics function 𝜙 (with parameters 𝜓 ) : 𝑋 × 𝐴 × Ψ → 𝑋 Challenge 1: Unknown states and parameters of dynamical models Image-based Prediction Image forecaster: 𝑦𝑡+1 = 𝑓pred (𝑦𝑡 , 𝑎𝑡 ) Latent forecaster: Encoding: 𝑧𝑡 −𝑚:𝑡 = 𝑓enc (𝑦𝑡 −𝑚:𝑡 ) Prediction: 𝑧𝑡+1:𝑡+𝑛 = 𝑓pred (𝑧𝑡 −𝑚:𝑡 ) Decoding: 𝑦𝑡+1:𝑡+𝑛 = 𝑓dec (𝑧𝑡+1:𝑡+𝑛 ) Challenge 2: Latent representations do not have physical meaning Positions Angular rates, accelerations, orientations Problem Definition Interpretable Image Encoder Given weak supervision signals (interval of states), minimize 1. Observation prediction error: 2. State prediction error: ■ A novel learning architecture designed to encode physically meaningful representations from high-dimensional observations in closed-loop systems. ■ A training pipeline that effectively incorporates weak supervision and accommodates unknown dynamics. ■ Experimental results that demonstrate superior performance in both physical interpretability and predictive accuracy compared to traditional world models. World model vs. dynamical loop Physically Interpretable World Model (PIWM) FUTURE WORK REFERENCES ■ Evidence Lower Bound (ELBO) of standard VAE: Eq(z∣y) [log p(y∣z)] is the expected log-likelihood. ● q(z∣y) is the approximate posterior to approximate the true posterior p(z∣y) The KL divergence quantifies the difference between the two distributions q(z∣y) and N(0,1): ■ Added weak supervision loss as a physically informative prior: ■ Added control action prediction loss (cross-entropy of control actions on the original images and reconstructed images) : ~ ■ Transformed weak supervision into a Gaussian for VAE compatibility: Position Error of Cart Pole Interval Length: 2.5% Interval Length: 5% Interval Length: 10% Angle Error of Cart Pole Interval Length: 2.5% Interval Length: 5% Interval Length: 10% MSE Comparison SSIM Comparison CONCLUSION Model Comparison: PIWM w.a., leveraging weak supervision and physical constraints, achieves the best performance, followed by PIWM wo.a., while VAE-WM and betaVAE-WM show significant limitations in capturing dynamic characteristics. Performance of Different Interval Length: As the interval length increases, PIWM models maintain low error and robustness, whereas VAE-WM and betaVAE-WM exhibit significant error growth, revealing their limitations in handling long-horizon tasks. Summary of MSE Results: PIWM w.a. achieves the lowest MSE across all interval lengths, demonstrating superior accuracy and robustness, while PIWM wo.a., VAE WM, and β-VAE WM show comparatively higher errors. Summary of Structural Similarity Index Measure (SSIM) Results: PIWM w.a. consistently delivers the highest SSIM values, preserving structural integrity across all intervals, outperforming PIWM wo.a., VAE WM, and β-VAE WM. - Incorporate finer-grained physical constraints for improved performance. - Develop adaptive supervision techniques to enhance flexibility. - Explore varying learning rates for extended prediction horizons. - Leverage physical invariances, e.g. symmetries and conservation laws, to boost robustness. - Broaden applicability to diverse systems with varying levels of observability and supervision. Kingma D P. Auto-encoding variational bayes, 2013. Higgins I, Matthey L, Pal A, et al. beta-vae: Learning basic visual concepts with a constrained variational framework, 2017. Ha and Schmidhuber, "Recurrent World Models Facilitate Policy Evolution", 2018. Mao Z, Dai S, Geng Y, et al. Zero-shot Safety Prediction for Autonomous Robots with Foundation World Models, 2024. Prediction Horizon Prediction Horizon Prediction Horizon