- 1. HUAWEI TECHNOLOGIES CO., LTD. www.huawei.com DARMDN: Deep autoregressive mixture density nets for dynamical system modelling — Balazs Kegl, Gabriel Hurtado, Albert Thomas for Noah's Ark Research Lab, Paris
- 2. HUAWEI TECHNOLOGIES CO., LTD. Page 2 Develop neural simulators trained on short system logs Objective B. Kegl / Huawei Research France
- 3. HUAWEI TECHNOLOGIES CO., LTD. Page 3 Why? Automate engineering systems › Data center cooling › Wireless parameter tuning › Wifi setup Predictive maintenance › Copper and optical end-user devices › Wireless network devices › Data center servers We believe these are only the tip of the iceberg B. Kegl / Huawei Research France
- 4. HUAWEI TECHNOLOGIES CO., LTD. Page 4 AI: Highly visible breakthroughs
- 5. HUAWEI TECHNOLOGIES CO., LTD. Page 5 Why aren't these algorithms already in engineering systems?
- 6. HUAWEI TECHNOLOGIES CO., LTD. Page 6 Physical systems do not get faster with time System access is tightly controlled by engineers whose responsibility is to keep the systems running Why is it hard? BU Engineer System 𝒂 𝒕 𝒐 𝒕, 𝒓𝒕 Micro-data!!! reinforcement learning
- 7. HUAWEI TECHNOLOGIES CO., LTD. Page 7 Generative time-series predictors (= neural system models) › Sample efficient: can be learned on a couple of thousands of time steps › Introspective and well-calibrated: honest about their own uncertainty Control and exploration using system models › Basic model predictive control (random shooting) › Active sampling and exploration › Learn the control agent › Multi-agent control and transfer learning Landing › Wireless parameter tuning › Data center cooling › Diagnostics and debugging tools usable by engineers Research program B. Kegl / Huawei Research France
- 8. HUAWEI TECHNOLOGIES CO., LTD. Page 8 Predict (random) future from history of system observables and control actions: 𝒐 𝑡+1 ~ 𝒑 𝒚 𝒐 𝑡+1 𝒙 𝒐1, 𝑎1 , … 𝒐 𝑡, 𝑎 𝑡 › We want to simulate multiple futures from the model Objective of neural system models B. Kegl / Huawei Research France
- 9. HUAWEI TECHNOLOGIES CO., LTD. Page 9 Generative regression: predict 𝒚 ~ 𝑝 𝒚 𝒙) instead of 𝒚 = 𝑓 𝒙 › Predictors that are honest about their uncertainty: introspective models Requirements › Both 𝒙 and 𝒚 are multidimensional › Training should scale well with the dimension of 𝒙 and 𝒚 and the size of the training data › Easy to compute likelihood › Easy to sample (simulate) › Able to model y-interdependence › Able to model different types of variables › Frequent semi-automatic retraining and retuning: robustness and debuggability Objective B. Kegl / Huawei Research France
- 10. HUAWEI TECHNOLOGIES CO., LTD. Page 10 Can AI learn physics (of a system) from data? 𝜽 𝟏 𝜽 𝟐 B. Kegl / Huawei Research France
- 11. HUAWEI TECHNOLOGIES CO., LTD. Page 11 Yes it can! Which one is the physical model and which one is AI? You can vote in the chat window: AI is left or right? B. Kegl / Huawei Research France
- 12. HUAWEI TECHNOLOGIES CO., LTD. Page 12 Formal model illustrated on acrobot System observables: 𝒐 = (𝜃2 , 𝜃2 , 𝜃1 , 𝜃1 ) Actions: torque at second joint, 𝑎 = {left, none, right} Objective: learn 𝒑(𝒐 𝑡+1|(𝒐1, 𝑎1), … , (𝒐 𝑡, 𝑎 𝑡)) Decomposition 1 (summarizing history): 𝒑(𝒐 𝑡+1|(𝒐1, 𝑎1), … , (𝒐 𝑡, 𝑎 𝑡)) = 𝒑 𝒐 𝑡+1 𝒇FE 𝒐1, 𝑎1 , … , 𝒐 𝑡, 𝑎 𝑡 𝒇FE is a time series feature extractor: 𝒔𝑡 = 𝒇FE 𝒐1, 𝑎1 , … , 𝒐 𝑡, 𝑎 𝑡 𝒑(𝒐 𝑡+1|(𝒐1, 𝑎1), … , (𝒐 𝑡, 𝑎 𝑡)) = 𝒑(𝒐 𝑡+1|𝒔𝑡) Decomposition 2 (autoregression): 𝒑 𝒐 𝑡+1 𝒔𝑡 = 𝑝1 𝜃𝑡+1 2 𝒔𝑡 𝑝2 𝜃𝑡+1 2 𝒔 𝑡, 𝜃𝑡+1 2 𝑝3 𝜃𝑡+1 1 𝒔 𝑡, 𝜃𝑡+1 2 , 𝜃𝑡+1 2 𝑝4 𝜃𝑡+1 1 𝒔 𝑡, 𝜃𝑡+1 2 , 𝜃𝑡+1 2 , 𝜃𝑡+1 1 Decomposition 3 (mixture model): 𝑝 𝑦 𝒙) = ℓ=1 𝐿 𝑤ℓ (𝒙)𝒫ℓ 𝑦; 𝜃ℓ (𝒙) 𝒫: component type (e.g. Gaussian) 𝑤: component weight 𝜃: component parameters (e.g. μ, 𝜎) B. Kegl / Huawei Research France 𝜽 𝟏 𝜽 𝟐
- 13. HUAWEI TECHNOLOGIES CO., LTD. Page 13 1. Explicit summary of history 𝒔𝑡 = 𝒇FE 𝒐1, 𝑎1 , … , 𝒐 𝑡, 𝑎 𝑡 › Simplifies the time series problem into "classical" prediction › System engineers can input prior knowledge › Can be fine-tuned using end to end training or extended to RNNs 2. Autoregression 𝑝 𝒚 𝒙) = 𝑝1 𝑦1 𝒙) 𝑗=2 𝑑 𝑝𝑗 𝑦 𝑗 𝑦1, … , 𝑦 𝑗−1, 𝒙) › Fighting curse of dimensionality: » We reduce the 𝑑-dimensional model into 𝑑 one-dimensional models › We can tune the models separately: » unlike e.g. images, system logs may have varying column types › Modelling y-interdependence: 𝑝 𝑦1 𝒙) and 𝑝 𝑦2 𝒙) can be strongly dependent in physical systems 3. Mixture model 𝑝 𝑦 𝒙) = ℓ=1 𝐿 𝑤ℓ(𝒙)𝒫ℓ 𝑦; 𝜃ℓ(𝒙) › Simple: easy to compute likelihood, easy to simulate from › Versatile: can use prior knowledge (component type), can approximate any density Why the decompositions? B. Kegl / Huawei Research France
- 14. HUAWEI TECHNOLOGIES CO., LTD. Page 14 Any regressor + fixed sigma: 𝑝 𝑦 𝒙) = 𝑵(𝒇 𝒙; 𝛉 , 𝝈) › Linear regression › Classical neural nets We learn the parameters (𝑤(𝒙) and 𝜃(𝒙)) with a deep neural net: deep autoregressive mixture density nets = DARMDN ("darm-dee-en") › DARMDN(1) with a single Gaussian component: heteroscedastic 𝑝 𝑦 𝒙) = 𝑵 𝝁 𝒙 , 𝝈 𝒙 › DARMDN(10) Non-autoregressive models › Gaussian process › DMDN(10): classical mixture density nets with multivariate Gaussian components [Bishop 1994] › Both assume y-independence How do we learn the model? B. Kegl / Huawei Research France
- 15. HUAWEI TECHNOLOGIES CO., LTD. Page 15 What is y-interdependence and why is it important? B. Kegl / Huawei Research France sin𝜃 cos𝜃 GP DMDN(5) DARMDN(1)
- 16. HUAWEI TECHNOLOGIES CO., LTD. Page 16 What is y-interdependence and why is it important? B. Kegl / Huawei Research France
- 17. HUAWEI TECHNOLOGIES CO., LTD. Page 18 Approximation capacity in system modelling › We want to be able to represent the real system dynamics efficiently › We also want to have realistic representation of uncertainty ("plausible futures") to support exploration "Raw angles" acrobot › Normally angles are transformed using sine and cosine to make the system dynamics smooth › What if we are agnostic? We do not know if a system variable is an angle › Abrupt jumps are OK, but if we have (epistemic) uncertainty, posteriors need to be multimodal B. Kegl / Huawei Research France Is multi-modal posterior predictive important?
- 18. HUAWEI TECHNOLOGIES CO., LTD. Page 19 Is multi-modal posterior predictive important? "raw angles" acrobot B. Kegl / Huawei Research France
- 19. HUAWEI TECHNOLOGIES CO., LTD. Page 20 › Baseline density ℒb is a multivariate unconditional spherical Gaussian › Measures how much the data is more likely under the learned model than under the baseline likelihood › Baseline = 1, higher the better, no limit Evaluation Likelihood ratio to simple baseline 𝐿𝑅 𝒐 𝑡, 𝑎 𝑡 𝑡=1 𝑇 ; 𝒑 = 𝒆ℒ 𝒐 𝑡,𝑎 𝑡 𝑡=1 𝑇 ;𝒑 𝒆ℒb 𝒐 𝑡,𝑎 𝑡 𝑡=1 𝑇 Log Likelihood ℒ 𝒐 𝑡, 𝑎 𝑡 𝑡=1 𝑇 ; 𝒑 = 1 𝑇 − 1 𝑡=1 𝑇−1 log 𝑝1 𝑜𝑡+1 1 𝒔 𝑡 + 𝑗=2 4 log 𝑝𝑗 𝑜𝑡+1 𝑗 𝒔 𝑡, 𝑜𝑡+1 1 , … , 𝑜𝑡+1 𝑗−1 B. Kegl / Huawei Research France
- 20. HUAWEI TECHNOLOGIES CO., LTD. Page 21 Results on skewed acrobot data Algorithm Acrobot "sincos", data generated with linear policy time series, 5K training points Likelihood ratio to spherical Gaussian Precision (R2) after 10 steps Calibratedness (Kolmogorov-Smirnov) after 10 steps Linear regression + constant sigma 2 4% 0.127 Gaussian process 56 83% 0.133 NN regression + constant sigma 32 55% 0.194 DMDN with 10 components 95 90% 0.128 DARMDN with 10 components 119 87% 0.095 B. Kegl / Huawei Research France DARMDN is both precise and well-calibrated OK, but does it matter for model-based RL?
- 21. HUAWEI TECHNOLOGIES CO., LTD. Page 22 1. Collect samples from a random policy 2. Train model on collected samples 3. Learn control policy on the model 4. Apply control policy on real system and collect the data, go back to 2. Model-based RL loop B. Kegl / Huawei Research France We retrain the model after each episode of 200 steps Control policy is classical random shooting (RS) [Richards 2005] › Simulate trajectories of 𝑁 = 10 steps using random actions › Select the optimal trajectory (with the highest reward after 𝑁 steps) › Execute the first action of the optimal trajectory
- 22. HUAWEI TECHNOLOGIES CO., LTD. Page 23B. Kegl / Huawei Research France Acrobot "raw angles" DARMDN with random shooting is the new SOTA › Almost as good as planning using the real system dynamics › Converges 2 to 4 times faster than previous SOTA x4 x2
- 23. HUAWEI TECHNOLOGIES CO., LTD. Page 24B. Kegl / Huawei Research France Acrobot "sincos"
- 24. HUAWEI TECHNOLOGIES CO., LTD. Page 25 Learnt policy after ~10k samples B. Kegl / Huawei Research France
- 25. HUAWEI TECHNOLOGIES CO., LTD. Page 26 Deterministic predictors B. Kegl / Huawei Research France Do we really need to represent uncertainty? 𝑝 𝑦 𝒙) = 𝐃𝐢𝐫𝐚𝐜(𝒇 𝒙; 𝛉 ) What models? › NNdet: classical neural net › DARMDN(10)det: mean of the predictive posterior: a deterministic model learned probabilistically
- 26. HUAWEI TECHNOLOGIES CO., LTD. Page 27B. Kegl / Huawei Research France Acrobot "raw angle": no surprise deterministic models are suboptimal
- 27. HUAWEI TECHNOLOGIES CO., LTD. Page 28B. Kegl / Huawei Research France Acrobot "raw angle": no surprise deterministic models are suboptimal
- 28. HUAWEI TECHNOLOGIES CO., LTD. Page 29B. Kegl / Huawei Research France Acrobot "sincos": what? Deterministic model is optimal but only if learned probabilistically
- 29. HUAWEI TECHNOLOGIES CO., LTD. Page 30B. Kegl / Huawei Research France Acrobot "sincos": what? Deterministic model is optimal but only if learned probabilistically
- 30. HUAWEI TECHNOLOGIES CO., LTD. Page 31 Is it heteroscedasticity or multimodality?
- 31. HUAWEI TECHNOLOGIES CO., LTD. Page 32B. Kegl / Huawei Research France It is heteroscedasticity
- 32. HUAWEI TECHNOLOGIES CO., LTD. Page 33 Model-based control, bandits, and reinforcement learning › Learn to control the system in a sample efficient way: » "real world will not become faster in a few years, contrary to computers" [Chatzilygeroudis et al., 2019] › State of the art suffers from the lack of efficient system modelling tools › Modelling uncertainties is crucial for safety Bayesian optimization › Require good and efficient models to quantify uncertainty due to unknown Transfer learning, meta-learning, and robust reinforcement learning › Precise probabilistic system models allow to transfer models between systems of the same kind Anomaly detection › Anomaly = system state is beyond "likely" behavior Broader applications of DARMDN B. Kegl / Huawei Research France
- 33. HUAWEI TECHNOLOGIES CO., LTD. Page 34 Deep autoregressive mixture density (DARMDN) + random shooting is new SOTA on Acrobot Autoregression is useful for modelling y-interdependence Multimodal posterior predictive is necessary on "raw angles" representation Deterministic DARMDN is as good as stochastic models on "sincos" representation, beats NN model trained for deterministic (RMSE) loss › Something happens in the long horizon, no error accumulation › Perhaps heteroscedastic epistemic uncertainty models may "let outliers go"? Conclusions B. Kegl / Huawei Research France
- 34. Thank you www.huawei.com Copyright©2015 Huawei Technologies Co., Ltd. All Rights Reserved. The information in this document may contain predictive statements including, without limitation, statements regarding the future financial and operating results, future product portfolio, new technology, etc. There are a number of factors that could cause actual results and developments to differ materially from those expressed or implied in the predictive statements. Therefore, such information is provided for reference purpose only and constitutes neither an offer nor an acceptance. Huawei may change the information at any time without notice. Page 35 HUAWEI TECHNOLOGIES CO., LTD.