Model-based reinforcement learning and self-driving engineering systems

HUAWEI TECHNOLOGIES CO., LTD.
www.huawei.com
Introduction to model-based reinforcement learning
Towards self-driving engineering systems
Balazs Kegl, Noah's Ark Research Lab, Paris
Joint work with Albert Thomas, Gabriel Hurtado, and Othman Gaizi

HUAWEI TECHNOLOGIES CO., LTD. Page 2
 AI research veteran (25 years)
› recently crossing over from academic research to industry
 In the last 5 years at CNRS I became interested in the human aspects of AI
tech transfer
› Within the scientific world: getting machine learning pipelines into sciences (astrophysics,
medical sciences, climate sciences, economy, etc.)
› Turned out that the management and organizational issues are very similar in industry
› The ultimate question: what should we work on?
 Leading a team of 15 at Huawei Noah's Ark Lab in Paris
› Research Scientists, Research Engineers, PhD students
› Partly doing AI research, partly solving BU problems
Who am I?
https://www.linkedin.com/in/balazs.kegl
https://twitter.com/balazs.kegl
https://balazskegl.medium.com/

Noah's Ark Paris team
 Composition
› 8 Permanent researchers: Balazs Kegl (lead), Merwan Barlier,
Chunchun Yang, Igor Colin, Ludovic Dos Santos, Albert Thomas,
Aladin Virmaux, Cedric Malherbe
› 3 research engineers: Illyyne Saffar, Gabriel Hurtado, Martin Tabikh
› 3 PhD students: George Dasoulas, Geovani Rizk, Paul Daoudi
 Expertise
› machine learning, optimization, reinforcement learning, deep
learning, distributed and multi-agent algorithms, robust ML, graph
theory, AutoML, transfer learning
 Growth
› 7 in Jan. 2018 to 11 in Nov. 2019 to 13 in 2020 to 18 in 2021

Part I
The why

The concept of interpretation is all here: there is no experience of truth
that is not interpretative. I do not know anything that does not interest me.
If it does interest me, it is evident that I do not look at it in a noninterested
way.
Gianni Vattimo: After the Death of God (talking about Heidegger)
 My dream: move AI from a propositional (function learning) paradigm towards
a procedural (goal-oriented) paradigm that incorporates data collection
 My day job: self-driving engineering systems
 Also: supervised learning is embedded in a frequent re-training/tuning loop
basically in all successful industrial ML pipelines

The big questions
 How does AI generate value?
 What problems we should solve?
› Most AI research is improving solutions on well-defined problems
 How to make sure that the solutions are useful within the organizational and
management constraints
› Derive the problems from the imagined workflow in which the solution will be used
› Note that this is a non-technical expertise, we also need organizational experts
› https://towardsdatascience.com/how-to-build-a-data-science-pipeline-f24341848045
B. Kegl / Huawei Research France

Meta
 Not a usual tutorial
› No breadth
› Rather a historical walk through our research process (~2 years)
› No theory (math, bounds), only intuitions (based on solid theoretical ground)
› Rather a mix of engineering and experimental scientific methodology to optimize and to learn
» Identify the problem to solve
» Look around for solutions
» Design solutions
» Design well-controlled experiments to understand properties of the solutions
 Q&A, discussion format is the zeitgeist
› There is no stupid question: if you don't understand something, chances are that half of the class
doesn't either

AI: Highly visible recent breakthroughs

Why these advances
are not already
in engineering systems?

A typical engineering control system
Engineer
System
𝒂𝒕
𝒐𝒕, 𝒓𝒕
Engineer observes
system states and performance indicators,
tunes some parameters time to time,
to optimize the performance indicators

Engineering systems = ~$10s of trillions per year

Our use cases
 Autopilots for engineering
systems
› Data center cooling
› Wireless parameter tuning
› Wi-Fi setup
 Making them
› Safer, better, more reliable, more
energy efficient
 We believe these are only the
tip of the iceberg

Automated control, if exists, is based on
deep understanding of the physics
of the system.

Sometimes it goes wrong

But mostly it works (it just doesn't learn)

What is AI (in this context)?
Learn the system behavior
based on historical data
and use it for better control

 SE:“I would like you to land AI to control my engineering system.”
 DS: “Ok, can I access your system with an algorithm which takes control of
the system, possibly breaking it sometimes in order to learn?”
 SE: “Over my dead body.”
A typical conversation between data scientists
(DS) and BU systems engineer (SE)

 DS: “OK, do you have a simulator which I can use to learn a control policy?”
 SE: “We are working on it. But in any case, it will never be good enough to
be trusted.”

 DS: “Can you execute a new control policy, after thorough checking and with
human safeguards, time to time and log the system variables and KPIs?
 SE: “Maybe.”

 The systems engineer thinks in classical tech transfer project management
terms
› Systems engineer specifies a problem
› Researcher solves it and delivers technology
 The data science process requires R&D iteration
› Systems engineer specifies a problem
› Data scientist describes what data/simulator/system she needs
› They design tools to provide/annotate data and interfaces to AI algorithms
› Data scientist designs algorithms, pipelines, experiments, metrics
› They iterate
What has just happened?

Controlled engineering system:
organizational constraints
 Offline (batch): system traces (logs)
 Micro-data: physical systems, high-quality logging is not priority
 Safety: we cannot "lose" while learning

"real world will not become faster in a few years,
contrary to computers"

Part II
The what

Iterated offline/batch RL
 Realistic:
› Fits the organizational scenario we can hope to implement
› Technically doable
› Not well-studied in research (cf trillion dollar market)

Model-based offline RL
 Why?
› Considered the best approach for the micro-data regime
› We do not waste predictive power (unlike, e.g., on images)
› System models (simulators) are useful on their own
› Self-supervision in RL

Model-free offline RL
 Why?
› Better asymptotic performance (a goal to aim at with MBRL)
› Better researched, good baselines
› MBRL planners (called "Dyna-styled") are essentially model-free algorithms

Contextual bandits / Bayesopt (zero order)
 Why?
› Rewards at every step, short delay

 Models for dynamic systems
› Which models to choose and based on what criteria?
› Separating epistemic and aleatory uncertainties: Can we verify? How to do it?
› Heteroscedasticity at training time proved to be crucial. Why?
› Causality/action sensitivity: building models leading to better treatment effect estimation
› Summarizing history (context): prior knowledge, attention.
› Distribution shift, transfer learning.
› Data check, online or offline, "fear" reaction (unknown behavior).
 Model-free reinforcement learning
› Which model-free or planning agents to choose on system models?
» Robustness to covariate shift
» Criteria to choose
› Best model-free offline RL algorithms, especially in terms of sample complexity.
› Which are the best contextual bandit/bayesopt algorithms?
› How to explore in the "slow" iterated offline setup.
 Safety
› How to formulate and enforce safety?
› When learning and when deploying the learned agent
› How to set the desired safety level flexibly?
› How to add safety to the exploration policy?
 Multi-agent control
› Multiple non-interacting systems, sharing their experience.
› Transferring the learned model and agent from one system to another.
› Interaction between the systems and the control agents.
› Optimizing multi-system rewards in a fair way.
 Policy evaluation and AutoML
› Toolbox, easy to use by novice data scientist or system engineer.
› Policy evaluation to select and tune models.
› Towards automating the process that learns the autopilot.
Research themes (3-4 year plan)
https://balazskegl.medium.com/building-autopilots-for-engineering-systems-using-ai-86a4f312c1f2
Albert, Balazs,
Othman, Gabriel
Igor, Ludo,
Merwan, Albert,
Alexandre,
Geovani
Ludo, Merwan,
Paul
Merwan, Ludo,
Igor

 Models for dynamic systems
› Which models to choose and based on what criteria?
› Separating epistemic and aleatory uncertainties: Can we verify? How to do it?
› Heteroscedasticity at training time proved to be crucial. Why?
› Causality/action sensitivity: building models leading to better treatment effect estimation
› Summarizing history (context): prior knowledge, attention.
› Distribution shift, transfer learning.
› Data check, online or offline, "fear" reaction (unknown behavior).
 Model-free reinforcement learning
› Which model-free or planning agents to choose on system models?
» Robustness to covariate shift
» Criteria to choose
› Best model-free offline RL algorithms, especially in terms of sample complexity.
› Which are the best contextual bandit/bayesopt algorithms?
› How to explore in the "slow" iterated offline setup.
 Safety
› How to formulate and enforce safety?
› When learning and when deploying the learned agent
› How to set the desired safety level flexibly?
› How to add safety to the exploration policy?
 Multi-agent control
› Multiple non-interacting systems, sharing their experience.
› Transferring the learned model and agent from one system to another.
› Interaction between the systems and the control agents.
› Optimizing multi-system rewards in a fair way.
 Policy evaluation and AutoML
› Toolbox, easy to use by novice data scientist or system engineer.
› Policy evaluation to select and tune models.
› Towards automating the process that learns the autopilot.
Subject of this course
https://balazskegl.medium.com/building-autopilots-for-engineering-systems-using-ai-86a4f312c1f2

Model-based offline RL

 Observables 𝒐
› ~10-100 dimensional, both internal (depend on actions) and external
› Mixed continuous, discrete, categorical; bounded or not
 Actions 𝒂
› ~1-100 dimensional
› Mixed continuous, discrete, categorical
 Rewards (called KPIs) 𝒓
› 1-10 dimensional, usually 𝒓 = 𝑓 𝒐 , continuous, short delay
› Multi-dimensional constraints (safety) and targets
 History
› Chunks of length 1000 - 100000
› Missing sensors and time steps
Typical use case

"real world will not become faster in a few years,
contrary to computers"

Micro-data model-based RL needs
reliable and scalable
system models

System model
=
multi-output
probabilistic (generative)
time series forecaster

 Generative time-series predictors
› Sample efficient: can be learned on a couple of thousands of time steps
› Introspective and well-calibrated: honest about their own uncertainty
› Self-tuning and/or robust, from 100 to 100000 training points
 Control and exploration using system models
› Basic model predictive control (random shooting)
› Active sampling and exploration
› Learn the control agent
 Landing
› Diagnostics and debugging tools usable by engineers
Research program

https://towardsdatascience.com/cabe95990664

 Predict (random) future from history of system observables and control
actions:
𝒐𝑡+1 ~ 𝒑
𝒚
𝒐𝑡+1
𝒙
𝒐1, 𝑎1 , … 𝒐𝑡, 𝑎𝑡
› We want to simulate
multiple futures from the model
System model = multi-output time series forecaster
present
future (simulated)
future (ground truth)
past

System model = multi-output time series forecaster

 Generative regression: predict 𝒚 ~ 𝑝 𝒚 𝒙) instead of 𝒚 = 𝑓 𝒙
› Predictors that are honest about their uncertainty: introspective models
 Requirements
› Both 𝒙 and 𝒚 are multidimensional
› Training should scale well with the dimension of 𝒙 and 𝒚 and the size of the training data
› Easy to compute likelihood
› Easy to sample (simulate)
› Able to model y-interdependence
› Able to model different types of variables
› Frequent semi-automatic retraining and retuning: robustness and debuggability
Objective
𝒐𝑡+1 ~ 𝒑
𝒚
𝒐𝑡+1
𝒙
𝒐1, 𝑎1 , … 𝒐𝑡, 𝑎𝑡

 What model?
› Deterministic predictor + fixed-sigma Gaussian
› (Conditional) Gaussian (mixture)
› autoregressive NNs and forests
› VAE
› GAN
› Flow models
Scientific questions I

 What are the important properties?
› Deterministic (classical predictors): 𝒚 ~ Dirac 𝒚 𝒙), 𝒚 = 𝒇(𝒙)
› Probabilistic 𝒚 ~ 𝑝 𝒚 𝒙)
» Homoscedastic (variance does not depend on the input) 𝒚 ~𝓝 𝒚 𝒇 𝒙 , 𝝈)
» Heteroscedastic (sigma does depend on the input)
– Unimodal 𝒚 ~𝓝 𝒚 𝒇 𝒙 , 𝝈(𝒙))
– Multimodal 𝒚 ~ ℓ=1
𝐿
𝑤ℓ
(𝒙)𝒫ℓ
𝑦; 𝜃ℓ
(𝒙)
» y-interdependent (being able to model (inter)dependence of components of 𝒚 given 𝒙)
Scientific questions II

What is y-interdependence and why it may be important?
sin𝜃
cos𝜃
GP
DMDN(5)
DARMDN(1)

What is y-interdependence and why it may be important?

Why generative models?

What is the probability of the world ending if I press this button?

 Why generative?
Besides point forecasts, predictors should also predict their uncertainty.
 Uncertainties are important for decision making: should I plan an outdoor
event?
› Instead of
“tomorrow’s max temperature is 26 degrees, it will be sunny”,
say that
“tomorrow’s max temperature is 26 degrees +- 3 degrees, 10% chance of rain”.
Generative time series forecasting

 Why generative?
Besides point forecasts, predictors should also predict their uncertainty.
› We need to simulate from the forecasting models, for model-based control and optimization.
When the forecast is consumed by a control or optimization module, uncertainty can be
propagated through the deterministic optimizer or planner by executing it on several random
simulated traces (“futures”). This is especially important when safety is at stakes, since we need
to model tail (extreme) event probabilities.
 Epistemic vs aleatory uncertainty
Generative time series forecasting

 Approximation capacity in system modelling
› We want to be able to represent the real system dynamics efficiently
› We also want to have realistic representation of uncertainty ("plausible futures") to support
exploration
 "Raw angles" acrobot
› Normally angles are transformed using sine and cosine to make the system dynamics smooth
› What if we are agnostic? We do not know if a system variable is an angle
› Abrupt jumps are OK, but if we have (epistemic) uncertainty, posteriors need to be multimodal
Is multi-modal posterior predictive important?

 What to do with a good system model?
› Plug it into a planning algorithm - no learning (beyond learning the system model)
› Learn an agent on the model and send it back to the real system ("Dyna-style")
» Exploration (iterative batch!): bad model and bad agent can be stuck while seem to have
converged
» Planning: we may just want to use the agent to guide the planning algorithm, not directly on
the real system
– When choosing the actions in the rollouts
– Bootstrapping the learned value at the last step (instead of just summing up the rewards)
Scientific questions III

Part III/a
The how
The experimental setup

› Both are based on experiments
› George Stevenson: makes sure the locomotive works, then optimize
› Carnot: understand the principles of thermodynamics, theorize,
design experiments to (in)validate hypotheses
› We need to publish: religion of the SOTA
› We also want to study the properties of the best approach
› Strategy: go straight ahead to optimize, then come back and check
rigorously what really matters (ablation)
› Let's start optimizing the model with a simple planning algorithm,
then move on to smart agents
› Business cases are out of reach for exhaustive experimentation, we
first need to learn to master our algorithms on toy benchmarks
Engineering or experimental
scientific approach?

Which system(s) or env(s)?
 The broad approach
› Good overview, huge work, and very useful!
› Helped us to choose a single env to start with
› Lacks in-depth understanding of individual envs and
hyperparameter optimization (what do we learn other
than which method works on which env?)

Which system(s) or env(s)?
 Our deep approach
› Choose a single env, understand and optimize it, reach
SOTA beyond doubt
› We chose Acrobot
» Relatively simple but non-trivial: we could learn
good system models on a couple of thousands of
training points
» Good model + simple planning is SOTA
» Previous SOTA happened to be very suboptimal
› Generalizability is in question: do our findings extend
to other envs?

The benchmark system: Acrobot
System observables: 𝒐 = (𝜃2, 𝜃2, 𝜃1, 𝜃1)
Actions: torque at second joint, 𝑎 = left, none, right
Reward: height of the tip of the lower segment
0: hanging position
2: ceiling
4: top position
Raw angles system: 𝒐 = 𝜃2, 𝜃2, 𝜃1, 𝜃1
jumps at ±π
Sincos system: 𝒐 = sin 𝜃2 , cos 𝜃2 , 𝜃2, sin 𝜃1 , cos 𝜃1 , 𝜃1
y-interdependence
𝜽𝟏
𝜽𝟐

Can we learn a precise system model from data?
𝜽𝟏
𝜽𝟐
𝒑(𝒐𝑡+1|(𝒐1, 𝑎1), … , (𝒐𝑡, 𝑎𝑡)) = 𝒑 𝒐𝑡+1 𝒐𝑡, 𝑎𝑡

Yes we can!
Which one is the physical model and which one is AI?
You can vote in the chat window: AI is left or right?
https://youtu.be/FHFz2ERB4eA

Let's jump ahead:
what do we do if we have a model?
Remember that our goal is
small sample complexity:
use system access steps as efficiently as possible

1. Collect samples from a random policy
2. Train model on collected samples
3. Learn (or just apply) control policy on the model
4. Apply control policy on real system and collect the data, go back to 2.
Model-based RL loop
(iterative batch)
 We retrain the model after each episode of 200 steps
 Control policy is classical random shooting (RS) [Richards 2005]
› Simulate 𝑛 trajectories of ℎ steps using random actions
› Select the optimal trajectory (with the highest reward after ℎ steps)
› Execute the first action of the optimal trajectory

 https://youtu.be/fgwQGTXgI1M
› Random policy,
mean reward = 0.1 (can go up to 0.5, halfway to the length of the lower link)
 https://youtu.be/X-qTJP5U78Q
› Suboptimal policy stuck below the horizon,
mean reward = 1.56
 https://youtu.be/Rwrf7-46aUE
› A good policy that, until recently, we thought was impossible to beat in a 200-step episode,
mean reward = 2.01
 https://youtu.be/XxiTVqxSS1o
› Currently optimal policy that stabilizes the Acrobot within the 200-step episode,
mean reward = 2.56
Acrobot is a non-trivial system

Acrobot is a non-trivial system

Part III/b
The how
The metrics

 We want high reward fast, "dynamic" metrics
› Unlike supervised learning, RL has no simply decipherable metrics
» Total reward depends on env, scale, number of steps
› Reliability: error bars (across episodes and seeds)
› (R)MAR: (relative) mean average reward after convergence
› MRCP(70): mean reward convergence pace
 We want to train, tune, and compare models on "static" metrics
› That matter for dynamic performance
› Time series regression metrics: MSE and R2
› Generative metrics: likelihood, (calibratedness), and (outlier ratio)
› Long horizon metrics: R2(h)
Metrics

Dynamic metrics
0: mean reward of random policy
1: mean reward of random shooting, h=10, n=100
convergent
transient
RMAR = 0.54 ± 0.03
RMAR = 1.23 ± 0.01
RMAR = 0.7
MRCP(70) = 1200 (system access step)
RMAR: Relative Mean Asymptotic Reward
MRCP(70): Mean Reward Convergence Pace
MRCP(70) = ∞

› ℒb is a multivariate unconditional spherical Gaussian
› Measures how much the data is more likely under the learned model than under the
baseline likelihood
› Baseline = 1, higher the better, no limit
Static metrics
Likelihood ratio to simple baseline
𝐿𝑅 𝒐𝑡, 𝑎𝑡 𝑡=1
𝑇
; 𝒑 =
𝒆ℒ 𝒐𝑡,𝑎𝑡 𝑡=1
𝑇
;𝒑
𝒆ℒb 𝒐𝑡,𝑎𝑡 𝑡=1
𝑇
Log Likelihood
ℒ 𝒐𝑡, 𝑎𝑡 𝑡=1
𝑇
; 𝒑 =
1
𝑇 − 1
𝑡=1
𝑇−1
log 𝒑 𝒐𝑡+1 𝒐𝑡, 𝑎𝑡

› Baseline = 0, higher the better, 1 is perfect
› Works both on deterministic and generative regressors
Static metrics
R2 (variance explained)
R2 𝒐𝑡, 𝑎𝑡 𝑡=1
𝑇
; 𝒑 =
1
𝑑𝒐
𝑗=1
𝑑𝒐
1 −
MSE𝑗 𝒐𝑡, 𝑎𝑡 𝑡=1
𝑇
; 𝒑
𝜎𝑗
2
Mean prediction, baseline variance, MSE
𝑓𝑗 𝒐𝑡, 𝑎𝑡 = EXP 𝑝𝑗 𝑜𝑡+1
𝑗
𝒐𝑡, 𝑎𝑡 𝜎𝑗
2
= VAR 𝑜𝑡
𝑗
𝑡=1
𝑇
MSE𝑗 𝒐𝑡, 𝑎𝑡 𝑡=1
𝑇
; 𝒑 =
1
𝑇 − 1
𝑡=1
𝑇−1
𝑜𝑡+1
𝑗
− 𝑓𝑗 𝒐𝑡, 𝑎𝑡
2

 Long horizon metrics
› Models predict 𝒐𝑡+1directly, but can be cascaded: 𝒐𝑡+2 = 𝑓 𝑓 𝒐𝑡
› Likelihood would need convolution, but R2(h) can be computed using Monte-Carlo
› We found that R2(10) correlates the best with dynamic performance
Static metrics

Part III/c
The how
The system models

Formal model illustrated on acrobot
System observables: 𝒐 = (𝜃2, 𝜃2, 𝜃1, 𝜃1)
Actions: torque at second joint, 𝑎 = {left, none, right}
Objective: learn 𝒑(𝒐𝑡+1|𝒐𝑡, 𝑎𝑡)
Decomposition 1 (autoregression):
𝒑 𝒐𝑡+1 𝒔𝑡 =
𝑝1 𝜃𝑡+1
2
𝒐𝑡, 𝑎𝑡
×
𝑝2 𝜃𝑡+1
2
𝒐𝑡, 𝑎𝑡, 𝜃𝑡+1
2
×
𝑝3 𝜃𝑡+1
1
2
, 𝜃𝑡+1
2
×
𝑝4 𝜃𝑡+1
1
2
, 𝜃𝑡+1
2
, 𝜃𝑡+1
1
Decomposition 2 (mixture model):
𝑝 𝑦 𝒙) =
ℓ=1
𝐿
𝑤ℓ
(𝒙)𝒫ℓ
𝑦; 𝜃ℓ
(𝒙)
𝒫: component type (e.g. Gaussian)
𝑤: component weight
𝜃: component parameters (e.g. μ, 𝜎)
𝜽𝟏
𝜽𝟐

 Autoregression 𝑝 𝒚 𝒙) = 𝑝1 𝑦1 𝒙) 𝑗=2
𝑑
𝑝𝑗 𝑦𝑗 𝑦1, … , 𝑦𝑗−1, 𝒙)
› Fighting curse of dimensionality:
» We reduce the 𝑑-dimensional model into 𝑑 one-dimensional models
› We can tune the models separately:
» unlike e.g. images, system logs may have varying column types
› Modelling y-interdependence: 𝑝 𝑦1 𝒙) and 𝑝 𝑦2 𝒙) can be strongly dependent in physical systems
 Mixture model 𝑝 𝑦 𝒙) = ℓ=1
𝐿
𝑤ℓ
(𝒙)𝒫ℓ
𝑦; 𝜃ℓ
(𝒙)
› Simple: easy to compute likelihood, easy to simulate from
› Versatile: can use prior knowledge (component type), can approximate any density
Why the decompositions?

Evaluation
Log Likelihood
ℒ 𝒐𝑡, 𝑎𝑡 𝑡=1
𝑇
; 𝒑 =
1
𝑇 − 1
𝑡=1
𝑇−1
log 𝑝1 𝑜𝑡+1
1
𝒐𝑡, 𝑎𝑡 +
𝑗=2
4
log 𝑝𝑗 𝑜𝑡+1
𝑗
𝒐𝑡, 𝑎𝑡, 𝑜𝑡+1
1
, … , 𝑜𝑡+1
𝑗−1

 Any regressor + fixed sigma: 𝑝 𝑦 𝒙) = 𝑵(𝒇 𝒙; 𝛉 , 𝝈)
› Linear regression (ARLinσ)
› Classical neural nets (DARNNσ)
 We learn the parameters (𝑤(𝒙) and 𝜃(𝒙)) with a deep neural net:
deep autoregressive mixture density nets = DARMDN ("darm-dee-en")
› DARMDN(1) with a single Gaussian component: heteroscedastic 𝑝 𝑦 𝒙) = 𝑵 𝝁 𝒙 , 𝝈 𝒙
› DARMDN(10) allows for multi-modality
› PETS [Chua et al 2018]: ensembled DARMDN(1)
 Non-autoregressive models
› Gaussian process
› DMDN(10): classical mixture density nets with multivariate Gaussian components [Bishop 1994]
› Both assume y-independence
› VAE, flow (RealNVP), GAN
How do we learn the model?

 Deterministic models
› When we shoot in random shooting (using the model to simulate futures), we can choose between
simulating from the mean or drawing from the conditional density
› DARNNdet , DARMDN(1)det , DARMDN(10)det , DMDN(10)det , PETSdet
How do we learn the model?

What did we find?

Part III/d
The how
The smart agents

Scientific questions III
› We know that we can achieve optimal
policy with longer horizon and more
simulation
› 1. Can we simply learn an agent on the
model and deploy it on the real system?
› The two tricks of AlphaGo: is it possible with
less simulations and shorter horizon if
» 2. the planning (search) is not random
but guided by a smart agent?
» 3. the estimated reward is not the
reward at the final step but the value
estimate of the smart agent?

Can we learn a smart agent on the model and
deploy in the real system? NO

Can we assist the planning with a smart agent?
YES

 Mixture density nets are optimal and versatile, especially the autoregressive
type
 Multimodal generative model may be needed depending on the env
 Deterministic model is slightly better if multimodality is not needed
 Heteroscedasticity is useful even when we use the deterministic mean
at simulation time!
 y-interdependence does not seem to matter
 Smart agents + planning + exploration beats both smart agents alone and
random shooting planning
Conclusions

Thank you
www.huawei.com
Copyright©2015 Huawei Technologies Co., Ltd. All Rights Reserved.
The information in this document may contain predictive statements including, without limitation, statements regarding the future
financial and operating results, future product portfolio, new technology, etc. There are a number of factors that could cause actual
results and developments to differ materially from those expressed or implied in the predictive statements. Therefore, such
information is provided for reference purpose only and constitutes neither an offer nor an acceptance. Huawei may change the
information at any time without notice.
Page 82 HUAWEI TECHNOLOGIES CO., LTD.

Model-based reinforcement learning and self-driving engineering systems

Recommended

Recommended

More Related Content

Similar to Model-based reinforcement learning and self-driving engineering systems

Similar to Model-based reinforcement learning and self-driving engineering systems (20)

More from Balázs Kégl

More from Balázs Kégl (10)

Recently uploaded

Recently uploaded (20)

Model-based reinforcement learning and self-driving engineering systems