DARMDN: Deep autoregressive mixture density nets for dynamical system modelling

HUAWEI TECHNOLOGIES CO., LTD.
www.huawei.com
DARMDN: Deep autoregressive mixture density nets for
dynamical system modelling
— Balazs Kegl, Gabriel Hurtado, Albert Thomas
for Noah's Ark Research Lab, Paris

HUAWEI TECHNOLOGIES CO., LTD. Page 2
Develop neural simulators
trained on short system logs
Objective
B. Kegl / Huawei Research France

Why?
 Automate engineering systems
› Data center cooling
› Wireless parameter tuning
› Wifi setup
 Predictive maintenance
› Copper and optical end-user devices
› Wireless network devices
› Data center servers
 We believe these are only the
tip of the iceberg

AI: Highly visible breakthroughs

Why aren't these algorithms
already
in engineering systems?

 Physical systems do not get faster with time
 System access is tightly controlled by engineers whose responsibility is to
keep the systems running
Why is it hard?
BU
Engineer
System
𝒂 𝒕
𝒐 𝒕, 𝒓𝒕
Micro-data!!! reinforcement learning

 Generative time-series predictors (= neural system models)
› Sample efficient: can be learned on a couple of thousands of time steps
› Introspective and well-calibrated: honest about their own uncertainty
 Control and exploration using system models
› Basic model predictive control (random shooting)
› Active sampling and exploration
› Learn the control agent
› Multi-agent control and transfer learning
 Landing
› Wireless parameter tuning
› Data center cooling
› Diagnostics and debugging tools usable by engineers
Research program









 Predict (random) future from history of system observables and control
actions:
𝒐 𝑡+1 ~ 𝒑
𝒚
𝒐 𝑡+1
𝒙
𝒐1, 𝑎1 , … 𝒐 𝑡, 𝑎 𝑡
› We want to simulate
multiple futures from the model
Objective of neural system models

 Generative regression: predict 𝒚 ~ 𝑝 𝒚 𝒙) instead of 𝒚 = 𝑓 𝒙
› Predictors that are honest about their uncertainty: introspective models
 Requirements
› Both 𝒙 and 𝒚 are multidimensional
› Training should scale well with the dimension of 𝒙 and 𝒚 and the size of the training data
› Easy to compute likelihood
› Easy to sample (simulate)
› Able to model y-interdependence
› Able to model different types of variables
› Frequent semi-automatic retraining and retuning: robustness and debuggability
Objective

Can AI learn physics (of a system) from data?
𝜽 𝟏
𝜽 𝟐

Yes it can!
Which one is the physical model and which one is AI?
You can vote in the chat window: AI is left or right?

Formal model illustrated on acrobot
System observables: 𝒐 = (𝜃2
, 𝜃2
, 𝜃1
, 𝜃1
)
Actions: torque at second joint, 𝑎 = {left, none, right}
Objective: learn 𝒑(𝒐 𝑡+1|(𝒐1, 𝑎1), … , (𝒐 𝑡, 𝑎 𝑡))
Decomposition 1 (summarizing history):
𝒑(𝒐 𝑡+1|(𝒐1, 𝑎1), … , (𝒐 𝑡, 𝑎 𝑡)) = 𝒑 𝒐 𝑡+1 𝒇FE 𝒐1, 𝑎1 , … , 𝒐 𝑡, 𝑎 𝑡
𝒇FE is a time series feature extractor:
𝒔𝑡 = 𝒇FE 𝒐1, 𝑎1 , … , 𝒐 𝑡, 𝑎 𝑡
𝒑(𝒐 𝑡+1|(𝒐1, 𝑎1), … , (𝒐 𝑡, 𝑎 𝑡)) = 𝒑(𝒐 𝑡+1|𝒔𝑡)
Decomposition 2 (autoregression):
𝒑 𝒐 𝑡+1 𝒔𝑡 =
𝑝1 𝜃𝑡+1
2
𝒔𝑡
𝑝2 𝜃𝑡+1
2
𝒔 𝑡, 𝜃𝑡+1
2
𝑝3 𝜃𝑡+1
1
2
, 𝜃𝑡+1
2
𝑝4 𝜃𝑡+1
1
2
, 𝜃𝑡+1
2
, 𝜃𝑡+1
1
Decomposition 3 (mixture model):
𝑝 𝑦 𝒙) =
ℓ=1
𝐿
𝑤ℓ
(𝒙)𝒫ℓ
𝑦; 𝜃ℓ
(𝒙)
𝒫: component type (e.g. Gaussian)
𝑤: component weight
𝜃: component parameters (e.g. μ, 𝜎)
𝜽 𝟏
𝜽 𝟐

 1. Explicit summary of history 𝒔𝑡 = 𝒇FE 𝒐1, 𝑎1 , … , 𝒐 𝑡, 𝑎 𝑡
› Simplifies the time series problem into "classical" prediction
› System engineers can input prior knowledge
› Can be fine-tuned using end to end training or extended to RNNs
 2. Autoregression 𝑝 𝒚 𝒙) = 𝑝1 𝑦1 𝒙) 𝑗=2
𝑑
𝑝𝑗 𝑦 𝑗 𝑦1, … , 𝑦 𝑗−1, 𝒙)
› Fighting curse of dimensionality:
» We reduce the 𝑑-dimensional model into 𝑑 one-dimensional models
› We can tune the models separately:
» unlike e.g. images, system logs may have varying column types
› Modelling y-interdependence: 𝑝 𝑦1 𝒙) and 𝑝 𝑦2 𝒙) can be strongly dependent in physical systems
 3. Mixture model 𝑝 𝑦 𝒙) = ℓ=1
𝐿
𝑤ℓ(𝒙)𝒫ℓ 𝑦; 𝜃ℓ(𝒙)
› Simple: easy to compute likelihood, easy to simulate from
› Versatile: can use prior knowledge (component type), can approximate any density
Why the decompositions?

 Any regressor + fixed sigma: 𝑝 𝑦 𝒙) = 𝑵(𝒇 𝒙; 𝛉 , 𝝈)
› Linear regression
› Classical neural nets
 We learn the parameters (𝑤(𝒙) and 𝜃(𝒙)) with a deep neural net:
deep autoregressive mixture density nets = DARMDN ("darm-dee-en")
› DARMDN(1) with a single Gaussian component: heteroscedastic 𝑝 𝑦 𝒙) = 𝑵 𝝁 𝒙 , 𝝈 𝒙
› DARMDN(10)
 Non-autoregressive models
› Gaussian process
› DMDN(10): classical mixture density nets with multivariate Gaussian components [Bishop 1994]
› Both assume y-independence
How do we learn the model?

What is y-interdependence and why is it important?
sin𝜃
cos𝜃
GP
DMDN(5)
DARMDN(1)

What is y-interdependence and why is it important?

 Approximation capacity in system modelling
› We want to be able to represent the real system dynamics efficiently
› We also want to have realistic representation of uncertainty ("plausible futures") to support
exploration
 "Raw angles" acrobot
› Normally angles are transformed using sine and cosine to make the system dynamics smooth
› What if we are agnostic? We do not know if a system variable is an angle
› Abrupt jumps are OK, but if we have (epistemic) uncertainty, posteriors need to be multimodal
Is multi-modal posterior predictive important?

Is multi-modal posterior predictive important?
"raw angles" acrobot

› Baseline density ℒb is a multivariate unconditional spherical Gaussian
› Measures how much the data is more likely under the learned model than under the
baseline likelihood
› Baseline = 1, higher the better, no limit
Evaluation
Likelihood ratio to simple baseline
𝐿𝑅 𝒐 𝑡, 𝑎 𝑡 𝑡=1
𝑇
; 𝒑 =
𝒆ℒ 𝒐 𝑡,𝑎 𝑡 𝑡=1
𝑇
;𝒑
𝒆ℒb 𝒐 𝑡,𝑎 𝑡 𝑡=1
𝑇
Log Likelihood
ℒ 𝒐 𝑡, 𝑎 𝑡 𝑡=1
𝑇
; 𝒑 =
1
𝑇 − 1
𝑡=1
𝑇−1
log 𝑝1 𝑜𝑡+1
1
𝒔 𝑡 +
𝑗=2
4
log 𝑝𝑗 𝑜𝑡+1
𝑗
𝒔 𝑡, 𝑜𝑡+1
1
, … , 𝑜𝑡+1
𝑗−1

Results on skewed acrobot data
Algorithm Acrobot "sincos", data generated with linear policy
time series, 5K training points
Likelihood ratio to
spherical Gaussian
Precision
(R2) after
10 steps
Calibratedness
(Kolmogorov-Smirnov)
after 10 steps
Linear regression + constant sigma 2 4% 0.127
Gaussian process 56 83% 0.133
NN regression + constant sigma 32 55% 0.194
DMDN with 10 components 95 90% 0.128
DARMDN with 10 components 119 87% 0.095
 DARMDN is both precise and well-calibrated
 OK, but does it matter for model-based RL?

1. Collect samples from a random policy
2. Train model on collected samples
3. Learn control policy on the model
4. Apply control policy on real system and collect the data, go back to 2.
Model-based RL loop
 We retrain the model after each episode of 200 steps
 Control policy is classical random shooting (RS) [Richards 2005]
› Simulate trajectories of 𝑁 = 10 steps using random actions
› Select the optimal trajectory (with the highest reward after 𝑁 steps)
› Execute the first action of the optimal trajectory

HUAWEI TECHNOLOGIES CO., LTD. Page 23B. Kegl / Huawei Research France
Acrobot "raw angles"
 DARMDN with random shooting is the new SOTA
› Almost as good as planning using the real system dynamics
› Converges 2 to 4 times faster than previous SOTA
x4
x2

Acrobot "sincos"

Learnt policy after ~10k samples

Deterministic predictors
 Do we really need to represent uncertainty?
 𝑝 𝑦 𝒙) = 𝐃𝐢𝐫𝐚𝐜(𝒇 𝒙; 𝛉 )
 What models?
› NNdet: classical neural net
› DARMDN(10)det: mean of the predictive posterior:
a deterministic model learned probabilistically

Acrobot "raw angle": no surprise
deterministic models are suboptimal

Acrobot "sincos": what?
Deterministic model is optimal but only if learned probabilistically

Is it heteroscedasticity or multimodality?

It is heteroscedasticity

 Model-based control, bandits, and reinforcement learning
› Learn to control the system in a sample efficient way:
» "real world will not become faster in a few years, contrary to computers"
[Chatzilygeroudis et al., 2019]
› State of the art suffers from the lack of efficient system modelling tools
› Modelling uncertainties is crucial for safety
 Bayesian optimization
› Require good and efficient models to quantify uncertainty due to unknown
 Transfer learning, meta-learning, and robust reinforcement learning
› Precise probabilistic system models allow to transfer models between systems of the same kind
 Anomaly detection
› Anomaly = system state is beyond "likely" behavior
Broader applications of DARMDN

 Deep autoregressive mixture density (DARMDN) + random shooting is new
SOTA on Acrobot
 Autoregression is useful for modelling y-interdependence
 Multimodal posterior predictive is necessary on "raw angles" representation
 Deterministic DARMDN is as good as stochastic models on "sincos"
representation, beats NN model trained for deterministic (RMSE) loss
› Something happens in the long horizon, no error accumulation
› Perhaps heteroscedastic epistemic uncertainty models may "let outliers go"?
Conclusions

Thank you
www.huawei.com
Copyright©2015 Huawei Technologies Co., Ltd. All Rights Reserved.
The information in this document may contain predictive statements including, without limitation, statements regarding the future
financial and operating results, future product portfolio, new technology, etc. There are a number of factors that could cause actual
results and developments to differ materially from those expressed or implied in the predictive statements. Therefore, such
information is provided for reference purpose only and constitutes neither an offer nor an acceptance. Huawei may change the
information at any time without notice.
Page 35 HUAWEI TECHNOLOGIES CO., LTD.

DARMDN: Deep autoregressive mixture density nets for dynamical system modelling

More Related Content

Similar to DARMDN: Deep autoregressive mixture density nets for dynamical system modelling

More from Balázs Kégl

Recently uploaded

DARMDN: Deep autoregressive mixture density nets for dynamical system modelling