DeepLearning summer school, Gran Canaria 2021
https://www.youtube.com/watch?v=B5UBJiwL_D0
I introduce reinforcement learning from a model-based perspective. In this paradigm the core of the algorithm is the system model: a multivariate generative (probabilistic) time-series predictor. The system model is combined with online planning and/or learning model-free agents on the system model. The course is designed for students with basic classical machine learning knowledge. My goal is to open up an interesting perspective while also giving you useful tools to tackle practical applications.
The main motivation of the course is learning and improving policies to control engineering systems (autopilots or "self-driving" systems). Unlike popular game benchmarks, these systems are low-dimensional (~10s to ~100s of dimensions) with rewards coming continuously or with a short delay. On the other hand, they are physical, slow, and system access is usually extremely restricted. The focus and the main algorithmic challenge is thus not representation learning and handling sparse rewards (as in games), rather learning robust system models on extremely small (100s or 1000s of time steps) non-iid data. The perspective is also an interesting extension of the classical supervised learning paradigm, in which functions are learned in a single shot on data generated (sampled and labeled) by an imaginary oracle. In the real world, supervised models are usually re-learned often, on non-iid data generated by a process which we partially control. The questions that we will ponder on (exploration, distribution shift, non-iid data) may thus also interest students planning to work on supervised machine learning in the real world.
Model-based reinforcement learning and self-driving engineering systems
1. HUAWEI TECHNOLOGIES CO., LTD.
www.huawei.com
Introduction to model-based reinforcement learning
Towards self-driving engineering systems
Balazs Kegl, Noah's Ark Research Lab, Paris
Joint work with Albert Thomas, Gabriel Hurtado, and Othman Gaizi
2. HUAWEI TECHNOLOGIES CO., LTD. Page 2
AI research veteran (25 years)
› recently crossing over from academic research to industry
In the last 5 years at CNRS I became interested in the human aspects of AI
tech transfer
› Within the scientific world: getting machine learning pipelines into sciences (astrophysics,
medical sciences, climate sciences, economy, etc.)
› Turned out that the management and organizational issues are very similar in industry
› The ultimate question: what should we work on?
Leading a team of 15 at Huawei Noah's Ark Lab in Paris
› Research Scientists, Research Engineers, PhD students
› Partly doing AI research, partly solving BU problems
Who am I?
https://www.linkedin.com/in/balazs.kegl
https://twitter.com/balazs.kegl
https://balazskegl.medium.com/
3. HUAWEI TECHNOLOGIES CO., LTD. Page 3
Noah's Ark Paris team
Composition
› 8 Permanent researchers: Balazs Kegl (lead), Merwan Barlier,
Chunchun Yang, Igor Colin, Ludovic Dos Santos, Albert Thomas,
Aladin Virmaux, Cedric Malherbe
› 3 research engineers: Illyyne Saffar, Gabriel Hurtado, Martin Tabikh
› 3 PhD students: George Dasoulas, Geovani Rizk, Paul Daoudi
Expertise
› machine learning, optimization, reinforcement learning, deep
learning, distributed and multi-agent algorithms, robust ML, graph
theory, AutoML, transfer learning
Growth
› 7 in Jan. 2018 to 11 in Nov. 2019 to 13 in 2020 to 18 in 2021
5. HUAWEI TECHNOLOGIES CO., LTD. Page 5
The concept of interpretation is all here: there is no experience of truth
that is not interpretative. I do not know anything that does not interest me.
If it does interest me, it is evident that I do not look at it in a noninterested
way.
Gianni Vattimo: After the Death of God (talking about Heidegger)
My dream: move AI from a propositional (function learning) paradigm towards
a procedural (goal-oriented) paradigm that incorporates data collection
My day job: self-driving engineering systems
Also: supervised learning is embedded in a frequent re-training/tuning loop
basically in all successful industrial ML pipelines
6. HUAWEI TECHNOLOGIES CO., LTD. Page 6
The big questions
How does AI generate value?
What problems we should solve?
› Most AI research is improving solutions on well-defined problems
How to make sure that the solutions are useful within the organizational and
management constraints
› Derive the problems from the imagined workflow in which the solution will be used
› Note that this is a non-technical expertise, we also need organizational experts
› https://towardsdatascience.com/how-to-build-a-data-science-pipeline-f24341848045
B. Kegl / Huawei Research France
7. HUAWEI TECHNOLOGIES CO., LTD. Page 7
Meta
Not a usual tutorial
› No breadth
› Rather a historical walk through our research process (~2 years)
› No theory (math, bounds), only intuitions (based on solid theoretical ground)
› Rather a mix of engineering and experimental scientific methodology to optimize and to learn
» Identify the problem to solve
» Look around for solutions
» Design solutions
» Design well-controlled experiments to understand properties of the solutions
Q&A, discussion format is the zeitgeist
› There is no stupid question: if you don't understand something, chances are that half of the class
doesn't either
B. Kegl / Huawei Research France
10. HUAWEI TECHNOLOGIES CO., LTD. Page 10
A typical engineering control system
Engineer
System
𝒂𝒕
𝒐𝒕, 𝒓𝒕
Engineer observes
system states and performance indicators,
tunes some parameters time to time,
to optimize the performance indicators
11. HUAWEI TECHNOLOGIES CO., LTD. Page 11
Engineering systems = ~$10s of trillions per year
B. Kegl / Huawei Research France
12. HUAWEI TECHNOLOGIES CO., LTD. Page 12
Our use cases
Autopilots for engineering
systems
› Data center cooling
› Wireless parameter tuning
› Wi-Fi setup
Making them
› Safer, better, more reliable, more
energy efficient
We believe these are only the
tip of the iceberg
B. Kegl / Huawei Research France
13. HUAWEI TECHNOLOGIES CO., LTD. Page 13
Automated control, if exists, is based on
deep understanding of the physics
of the system.
16. HUAWEI TECHNOLOGIES CO., LTD. Page 16
What is AI (in this context)?
Learn the system behavior
based on historical data
and use it for better control
17. HUAWEI TECHNOLOGIES CO., LTD. Page 17
SE:“I would like you to land AI to control my engineering system.”
DS: “Ok, can I access your system with an algorithm which takes control of
the system, possibly breaking it sometimes in order to learn?”
SE: “Over my dead body.”
A typical conversation between data scientists
(DS) and BU systems engineer (SE)
18. HUAWEI TECHNOLOGIES CO., LTD. Page 18
DS: “OK, do you have a simulator which I can use to learn a control policy?”
SE: “We are working on it. But in any case, it will never be good enough to
be trusted.”
A typical conversation between data scientists
(DS) and BU systems engineer (SE)
19. HUAWEI TECHNOLOGIES CO., LTD. Page 19
DS: “Can you execute a new control policy, after thorough checking and with
human safeguards, time to time and log the system variables and KPIs?
SE: “Maybe.”
A typical conversation between data scientists
(DS) and BU systems engineer (SE)
20. HUAWEI TECHNOLOGIES CO., LTD. Page 20
The systems engineer thinks in classical tech transfer project management
terms
› Systems engineer specifies a problem
› Researcher solves it and delivers technology
The data science process requires R&D iteration
› Systems engineer specifies a problem
› Data scientist describes what data/simulator/system she needs
› They design tools to provide/annotate data and interfaces to AI algorithms
› Data scientist designs algorithms, pipelines, experiments, metrics
› They iterate
What has just happened?
21. HUAWEI TECHNOLOGIES CO., LTD. Page 21
Controlled engineering system:
organizational constraints
Offline (batch): system traces (logs)
Micro-data: physical systems, high-quality logging is not priority
Safety: we cannot "lose" while learning
B. Kegl / Huawei Research France
22. HUAWEI TECHNOLOGIES CO., LTD. Page 22
"real world will not become faster in a few years,
contrary to computers"
24. HUAWEI TECHNOLOGIES CO., LTD. Page 24
Iterated offline/batch RL
Realistic:
› Fits the organizational scenario we can hope to implement
› Technically doable
› Not well-studied in research (cf trillion dollar market)
B. Kegl / Huawei Research France
25. HUAWEI TECHNOLOGIES CO., LTD. Page 25
Model-based offline RL
Why?
› Considered the best approach for the micro-data regime
› We do not waste predictive power (unlike, e.g., on images)
› System models (simulators) are useful on their own
› Self-supervision in RL
B. Kegl / Huawei Research France
26. HUAWEI TECHNOLOGIES CO., LTD. Page 26
Model-free offline RL
Why?
› Better asymptotic performance (a goal to aim at with MBRL)
› Better researched, good baselines
› MBRL planners (called "Dyna-styled") are essentially model-free algorithms
B. Kegl / Huawei Research France
27. HUAWEI TECHNOLOGIES CO., LTD. Page 27
Contextual bandits / Bayesopt (zero order)
Why?
› Rewards at every step, short delay
B. Kegl / Huawei Research France
28. HUAWEI TECHNOLOGIES CO., LTD. Page 28
Models for dynamic systems
› Which models to choose and based on what criteria?
› Separating epistemic and aleatory uncertainties: Can we verify? How to do it?
› Heteroscedasticity at training time proved to be crucial. Why?
› Causality/action sensitivity: building models leading to better treatment effect estimation
› Summarizing history (context): prior knowledge, attention.
› Distribution shift, transfer learning.
› Data check, online or offline, "fear" reaction (unknown behavior).
Model-free reinforcement learning
› Which model-free or planning agents to choose on system models?
» Robustness to covariate shift
» Criteria to choose
› Best model-free offline RL algorithms, especially in terms of sample complexity.
› Which are the best contextual bandit/bayesopt algorithms?
› How to explore in the "slow" iterated offline setup.
Safety
› How to formulate and enforce safety?
› When learning and when deploying the learned agent
› How to set the desired safety level flexibly?
› How to add safety to the exploration policy?
Multi-agent control
› Multiple non-interacting systems, sharing their experience.
› Transferring the learned model and agent from one system to another.
› Interaction between the systems and the control agents.
› Optimizing multi-system rewards in a fair way.
Policy evaluation and AutoML
› Toolbox, easy to use by novice data scientist or system engineer.
› Policy evaluation to select and tune models.
› Towards automating the process that learns the autopilot.
Research themes (3-4 year plan)
https://balazskegl.medium.com/building-autopilots-for-engineering-systems-using-ai-86a4f312c1f2
B. Kegl / Huawei Research France
Albert, Balazs,
Othman, Gabriel
Igor, Ludo,
Merwan, Albert,
Alexandre,
Geovani
Ludo, Merwan,
Paul
Merwan, Ludo,
Igor
29. HUAWEI TECHNOLOGIES CO., LTD. Page 29
Models for dynamic systems
› Which models to choose and based on what criteria?
› Separating epistemic and aleatory uncertainties: Can we verify? How to do it?
› Heteroscedasticity at training time proved to be crucial. Why?
› Causality/action sensitivity: building models leading to better treatment effect estimation
› Summarizing history (context): prior knowledge, attention.
› Distribution shift, transfer learning.
› Data check, online or offline, "fear" reaction (unknown behavior).
Model-free reinforcement learning
› Which model-free or planning agents to choose on system models?
» Robustness to covariate shift
» Criteria to choose
› Best model-free offline RL algorithms, especially in terms of sample complexity.
› Which are the best contextual bandit/bayesopt algorithms?
› How to explore in the "slow" iterated offline setup.
Safety
› How to formulate and enforce safety?
› When learning and when deploying the learned agent
› How to set the desired safety level flexibly?
› How to add safety to the exploration policy?
Multi-agent control
› Multiple non-interacting systems, sharing their experience.
› Transferring the learned model and agent from one system to another.
› Interaction between the systems and the control agents.
› Optimizing multi-system rewards in a fair way.
Policy evaluation and AutoML
› Toolbox, easy to use by novice data scientist or system engineer.
› Policy evaluation to select and tune models.
› Towards automating the process that learns the autopilot.
Subject of this course
https://balazskegl.medium.com/building-autopilots-for-engineering-systems-using-ai-86a4f312c1f2
B. Kegl / Huawei Research France
31. HUAWEI TECHNOLOGIES CO., LTD. Page 31
Observables 𝒐
› ~10-100 dimensional, both internal (depend on actions) and external
› Mixed continuous, discrete, categorical; bounded or not
Actions 𝒂
› ~1-100 dimensional
› Mixed continuous, discrete, categorical
Rewards (called KPIs) 𝒓
› 1-10 dimensional, usually 𝒓 = 𝑓 𝒐 , continuous, short delay
› Multi-dimensional constraints (safety) and targets
History
› Chunks of length 1000 - 100000
› Missing sensors and time steps
Typical use case
B. Kegl / Huawei Research France
32. HUAWEI TECHNOLOGIES CO., LTD. Page 32
"real world will not become faster in a few years,
contrary to computers"
33. HUAWEI TECHNOLOGIES CO., LTD. Page 33
Micro-data model-based RL needs
reliable and scalable
system models
34. HUAWEI TECHNOLOGIES CO., LTD. Page 34
System model
=
multi-output
probabilistic (generative)
time series forecaster
35. HUAWEI TECHNOLOGIES CO., LTD. Page 35
Generative time-series predictors
› Sample efficient: can be learned on a couple of thousands of time steps
› Introspective and well-calibrated: honest about their own uncertainty
› Self-tuning and/or robust, from 100 to 100000 training points
Control and exploration using system models
› Basic model predictive control (random shooting)
› Active sampling and exploration
› Learn the control agent
Landing
› Diagnostics and debugging tools usable by engineers
Research program
B. Kegl / Huawei Research France
37. HUAWEI TECHNOLOGIES CO., LTD. Page 37
Predict (random) future from history of system observables and control
actions:
𝒐𝑡+1 ~ 𝒑
𝒚
𝒐𝑡+1
𝒙
𝒐1, 𝑎1 , … 𝒐𝑡, 𝑎𝑡
› We want to simulate
multiple futures from the model
System model = multi-output time series forecaster
B. Kegl / Huawei Research France
present
future (simulated)
future (ground truth)
past
38. HUAWEI TECHNOLOGIES CO., LTD. Page 38
System model = multi-output time series forecaster
B. Kegl / Huawei Research France
39. HUAWEI TECHNOLOGIES CO., LTD. Page 39
Generative regression: predict 𝒚 ~ 𝑝 𝒚 𝒙) instead of 𝒚 = 𝑓 𝒙
› Predictors that are honest about their uncertainty: introspective models
Requirements
› Both 𝒙 and 𝒚 are multidimensional
› Training should scale well with the dimension of 𝒙 and 𝒚 and the size of the training data
› Easy to compute likelihood
› Easy to sample (simulate)
› Able to model y-interdependence
› Able to model different types of variables
› Frequent semi-automatic retraining and retuning: robustness and debuggability
Objective
B. Kegl / Huawei Research France
𝒐𝑡+1 ~ 𝒑
𝒚
𝒐𝑡+1
𝒙
𝒐1, 𝑎1 , … 𝒐𝑡, 𝑎𝑡
40. HUAWEI TECHNOLOGIES CO., LTD. Page 40
What model?
› Deterministic predictor + fixed-sigma Gaussian
› (Conditional) Gaussian (mixture)
› autoregressive NNs and forests
› VAE
› GAN
› Flow models
Scientific questions I
B. Kegl / Huawei Research France
41. HUAWEI TECHNOLOGIES CO., LTD. Page 41
What are the important properties?
› Deterministic (classical predictors): 𝒚 ~ Dirac 𝒚 𝒙), 𝒚 = 𝒇(𝒙)
› Probabilistic 𝒚 ~ 𝑝 𝒚 𝒙)
» Homoscedastic (variance does not depend on the input) 𝒚 ~𝓝 𝒚 𝒇 𝒙 , 𝝈)
» Heteroscedastic (sigma does depend on the input)
– Unimodal 𝒚 ~𝓝 𝒚 𝒇 𝒙 , 𝝈(𝒙))
– Multimodal 𝒚 ~ ℓ=1
𝐿
𝑤ℓ
(𝒙)𝒫ℓ
𝑦; 𝜃ℓ
(𝒙)
» y-interdependent (being able to model (inter)dependence of components of 𝒚 given 𝒙)
Scientific questions II
B. Kegl / Huawei Research France
42. HUAWEI TECHNOLOGIES CO., LTD. Page 42
What is y-interdependence and why it may be important?
B. Kegl / Huawei Research France
sin𝜃
cos𝜃
GP
DMDN(5)
DARMDN(1)
43. HUAWEI TECHNOLOGIES CO., LTD. Page 43
What is y-interdependence and why it may be important?
B. Kegl / Huawei Research France
45. HUAWEI TECHNOLOGIES CO., LTD. Page 45
What is the probability of the world ending if I press this button?
46. HUAWEI TECHNOLOGIES CO., LTD. Page 46
Why generative?
Besides point forecasts, predictors should also predict their uncertainty.
Uncertainties are important for decision making: should I plan an outdoor
event?
› Instead of
“tomorrow’s max temperature is 26 degrees, it will be sunny”,
say that
“tomorrow’s max temperature is 26 degrees +- 3 degrees, 10% chance of rain”.
Generative time series forecasting
47. HUAWEI TECHNOLOGIES CO., LTD. Page 47
Why generative?
Besides point forecasts, predictors should also predict their uncertainty.
› We need to simulate from the forecasting models, for model-based control and optimization.
When the forecast is consumed by a control or optimization module, uncertainty can be
propagated through the deterministic optimizer or planner by executing it on several random
simulated traces (“futures”). This is especially important when safety is at stakes, since we need
to model tail (extreme) event probabilities.
Epistemic vs aleatory uncertainty
Generative time series forecasting
48. HUAWEI TECHNOLOGIES CO., LTD. Page 48
Approximation capacity in system modelling
› We want to be able to represent the real system dynamics efficiently
› We also want to have realistic representation of uncertainty ("plausible futures") to support
exploration
"Raw angles" acrobot
› Normally angles are transformed using sine and cosine to make the system dynamics smooth
› What if we are agnostic? We do not know if a system variable is an angle
› Abrupt jumps are OK, but if we have (epistemic) uncertainty, posteriors need to be multimodal
B. Kegl / Huawei Research France
Is multi-modal posterior predictive important?
49. HUAWEI TECHNOLOGIES CO., LTD. Page 49
What to do with a good system model?
› Plug it into a planning algorithm - no learning (beyond learning the system model)
› Learn an agent on the model and send it back to the real system ("Dyna-style")
» Exploration (iterative batch!): bad model and bad agent can be stuck while seem to have
converged
» Planning: we may just want to use the agent to guide the planning algorithm, not directly on
the real system
– When choosing the actions in the rollouts
– Bootstrapping the learned value at the last step (instead of just summing up the rewards)
Scientific questions III
B. Kegl / Huawei Research France
52. HUAWEI TECHNOLOGIES CO., LTD. Page 52
› Both are based on experiments
› George Stevenson: makes sure the locomotive works, then optimize
› Carnot: understand the principles of thermodynamics, theorize,
design experiments to (in)validate hypotheses
› We need to publish: religion of the SOTA
› We also want to study the properties of the best approach
› Strategy: go straight ahead to optimize, then come back and check
rigorously what really matters (ablation)
› Let's start optimizing the model with a simple planning algorithm,
then move on to smart agents
› Business cases are out of reach for exhaustive experimentation, we
first need to learn to master our algorithms on toy benchmarks
Engineering or experimental
scientific approach?
B. Kegl / Huawei Research France
53. HUAWEI TECHNOLOGIES CO., LTD. Page 53
Which system(s) or env(s)?
B. Kegl / Huawei Research France
The broad approach
› Good overview, huge work, and very useful!
› Helped us to choose a single env to start with
› Lacks in-depth understanding of individual envs and
hyperparameter optimization (what do we learn other
than which method works on which env?)
54. HUAWEI TECHNOLOGIES CO., LTD. Page 54
Which system(s) or env(s)?
B. Kegl / Huawei Research France
Our deep approach
› Choose a single env, understand and optimize it, reach
SOTA beyond doubt
› We chose Acrobot
» Relatively simple but non-trivial: we could learn
good system models on a couple of thousands of
training points
» Good model + simple planning is SOTA
» Previous SOTA happened to be very suboptimal
› Generalizability is in question: do our findings extend
to other envs?
55. HUAWEI TECHNOLOGIES CO., LTD. Page 55
The benchmark system: Acrobot
System observables: 𝒐 = (𝜃2, 𝜃2, 𝜃1, 𝜃1)
Actions: torque at second joint, 𝑎 = left, none, right
Reward: height of the tip of the lower segment
0: hanging position
2: ceiling
4: top position
Raw angles system: 𝒐 = 𝜃2, 𝜃2, 𝜃1, 𝜃1
jumps at ±π
Sincos system: 𝒐 = sin 𝜃2 , cos 𝜃2 , 𝜃2, sin 𝜃1 , cos 𝜃1 , 𝜃1
y-interdependence
B. Kegl / Huawei Research France
𝜽𝟏
𝜽𝟐
56. HUAWEI TECHNOLOGIES CO., LTD. Page 56
Can we learn a precise system model from data?
𝜽𝟏
𝜽𝟐
B. Kegl / Huawei Research France
𝒑(𝒐𝑡+1|(𝒐1, 𝑎1), … , (𝒐𝑡, 𝑎𝑡)) = 𝒑 𝒐𝑡+1 𝒐𝑡, 𝑎𝑡
57. HUAWEI TECHNOLOGIES CO., LTD. Page 57
Yes we can!
Which one is the physical model and which one is AI?
You can vote in the chat window: AI is left or right?
https://youtu.be/FHFz2ERB4eA
B. Kegl / Huawei Research France
58. HUAWEI TECHNOLOGIES CO., LTD. Page 58
Let's jump ahead:
what do we do if we have a model?
Remember that our goal is
small sample complexity:
use system access steps as efficiently as possible
B. Kegl / Huawei Research France
59. HUAWEI TECHNOLOGIES CO., LTD. Page 59
1. Collect samples from a random policy
2. Train model on collected samples
3. Learn (or just apply) control policy on the model
4. Apply control policy on real system and collect the data, go back to 2.
Model-based RL loop
(iterative batch)
B. Kegl / Huawei Research France
We retrain the model after each episode of 200 steps
Control policy is classical random shooting (RS) [Richards 2005]
› Simulate 𝑛 trajectories of ℎ steps using random actions
› Select the optimal trajectory (with the highest reward after ℎ steps)
› Execute the first action of the optimal trajectory
60. HUAWEI TECHNOLOGIES CO., LTD. Page 60
https://youtu.be/fgwQGTXgI1M
› Random policy,
mean reward = 0.1 (can go up to 0.5, halfway to the length of the lower link)
https://youtu.be/X-qTJP5U78Q
› Suboptimal policy stuck below the horizon,
mean reward = 1.56
https://youtu.be/Rwrf7-46aUE
› A good policy that, until recently, we thought was impossible to beat in a 200-step episode,
mean reward = 2.01
https://youtu.be/XxiTVqxSS1o
› Currently optimal policy that stabilizes the Acrobot within the 200-step episode,
mean reward = 2.56
Acrobot is a non-trivial system
B. Kegl / Huawei Research France
61. HUAWEI TECHNOLOGIES CO., LTD. Page 61
Acrobot is a non-trivial system
B. Kegl / Huawei Research France
63. HUAWEI TECHNOLOGIES CO., LTD. Page 63
We want high reward fast, "dynamic" metrics
› Unlike supervised learning, RL has no simply decipherable metrics
» Total reward depends on env, scale, number of steps
› Reliability: error bars (across episodes and seeds)
› (R)MAR: (relative) mean average reward after convergence
› MRCP(70): mean reward convergence pace
We want to train, tune, and compare models on "static" metrics
› That matter for dynamic performance
› Time series regression metrics: MSE and R2
› Generative metrics: likelihood, (calibratedness), and (outlier ratio)
› Long horizon metrics: R2(h)
Metrics
B. Kegl / Huawei Research France
64. HUAWEI TECHNOLOGIES CO., LTD. Page 64
Dynamic metrics
B. Kegl / Huawei Research France
0: mean reward of random policy
1: mean reward of random shooting, h=10, n=100
convergent
transient
RMAR = 0.54 ± 0.03
RMAR = 1.23 ± 0.01
RMAR = 0.7
MRCP(70) = 1200 (system access step)
RMAR: Relative Mean Asymptotic Reward
MRCP(70): Mean Reward Convergence Pace
MRCP(70) = ∞
65. HUAWEI TECHNOLOGIES CO., LTD. Page 65
› ℒb is a multivariate unconditional spherical Gaussian
› Measures how much the data is more likely under the learned model than under the
baseline likelihood
› Baseline = 1, higher the better, no limit
Static metrics
Likelihood ratio to simple baseline
𝐿𝑅 𝒐𝑡, 𝑎𝑡 𝑡=1
𝑇
; 𝒑 =
𝒆ℒ 𝒐𝑡,𝑎𝑡 𝑡=1
𝑇
;𝒑
𝒆ℒb 𝒐𝑡,𝑎𝑡 𝑡=1
𝑇
Log Likelihood
ℒ 𝒐𝑡, 𝑎𝑡 𝑡=1
𝑇
; 𝒑 =
1
𝑇 − 1
𝑡=1
𝑇−1
log 𝒑 𝒐𝑡+1 𝒐𝑡, 𝑎𝑡
67. HUAWEI TECHNOLOGIES CO., LTD. Page 67
Long horizon metrics
› Models predict 𝒐𝑡+1directly, but can be cascaded: 𝒐𝑡+2 = 𝑓 𝑓 𝒐𝑡
› Likelihood would need convolution, but R2(h) can be computed using Monte-Carlo
› We found that R2(10) correlates the best with dynamic performance
Static metrics
B. Kegl / Huawei Research France
70. HUAWEI TECHNOLOGIES CO., LTD. Page 70
Autoregression 𝑝 𝒚 𝒙) = 𝑝1 𝑦1 𝒙) 𝑗=2
𝑑
𝑝𝑗 𝑦𝑗 𝑦1, … , 𝑦𝑗−1, 𝒙)
› Fighting curse of dimensionality:
» We reduce the 𝑑-dimensional model into 𝑑 one-dimensional models
› We can tune the models separately:
» unlike e.g. images, system logs may have varying column types
› Modelling y-interdependence: 𝑝 𝑦1 𝒙) and 𝑝 𝑦2 𝒙) can be strongly dependent in physical systems
Mixture model 𝑝 𝑦 𝒙) = ℓ=1
𝐿
𝑤ℓ
(𝒙)𝒫ℓ
𝑦; 𝜃ℓ
(𝒙)
› Simple: easy to compute likelihood, easy to simulate from
› Versatile: can use prior knowledge (component type), can approximate any density
Why the decompositions?
B. Kegl / Huawei Research France
72. HUAWEI TECHNOLOGIES CO., LTD. Page 72
Any regressor + fixed sigma: 𝑝 𝑦 𝒙) = 𝑵(𝒇 𝒙; 𝛉 , 𝝈)
› Linear regression (ARLinσ)
› Classical neural nets (DARNNσ)
We learn the parameters (𝑤(𝒙) and 𝜃(𝒙)) with a deep neural net:
deep autoregressive mixture density nets = DARMDN ("darm-dee-en")
› DARMDN(1) with a single Gaussian component: heteroscedastic 𝑝 𝑦 𝒙) = 𝑵 𝝁 𝒙 , 𝝈 𝒙
› DARMDN(10) allows for multi-modality
› PETS [Chua et al 2018]: ensembled DARMDN(1)
Non-autoregressive models
› Gaussian process
› DMDN(10): classical mixture density nets with multivariate Gaussian components [Bishop 1994]
› Both assume y-independence
› VAE, flow (RealNVP), GAN
How do we learn the model?
B. Kegl / Huawei Research France
73. HUAWEI TECHNOLOGIES CO., LTD. Page 73
Deterministic models
› When we shoot in random shooting (using the model to simulate futures), we can choose between
simulating from the mean or drawing from the conditional density
› DARNNdet , DARMDN(1)det , DARMDN(10)det , DMDN(10)det , PETSdet
How do we learn the model?
B. Kegl / Huawei Research France
78. HUAWEI TECHNOLOGIES CO., LTD. Page 78
Scientific questions III
B. Kegl / Huawei Research France
› We know that we can achieve optimal
policy with longer horizon and more
simulation
› 1. Can we simply learn an agent on the
model and deploy it on the real system?
› The two tricks of AlphaGo: is it possible with
less simulations and shorter horizon if
» 2. the planning (search) is not random
but guided by a smart agent?
» 3. the estimated reward is not the
reward at the final step but the value
estimate of the smart agent?
79. HUAWEI TECHNOLOGIES CO., LTD. Page 79
Can we learn a smart agent on the model and
deploy in the real system? NO
B. Kegl / Huawei Research France
80. HUAWEI TECHNOLOGIES CO., LTD. Page 80
Can we assist the planning with a smart agent?
YES
B. Kegl / Huawei Research France
81. HUAWEI TECHNOLOGIES CO., LTD. Page 81
Mixture density nets are optimal and versatile, especially the autoregressive
type
Multimodal generative model may be needed depending on the env
Deterministic model is slightly better if multimodality is not needed
Heteroscedasticity is useful even when we use the deterministic mean
at simulation time!
y-interdependence does not seem to matter
Smart agents + planning + exploration beats both smart agents alone and
random shooting planning
Conclusions
B. Kegl / Huawei Research France