WORLD MODELS
Presentation By Duane Nielsen And Pushkar Merwah
World models
Background: Reinforcement Learning
World Model Architecture
 View
 Model
 Controller
EXPERIMENT 1- OPENAI GYM – Car Racing V0
 Ablation Study
Experiment 2 – Vizdoom
 Training Policy From “World Model Dream"
Basics of Reinforcement Learning
Left Brain
What is reinforcement learning
Learning expert representation required to achieve an objective, given a success metric, from raw
inputs in the absence of domain knowledge.
Introduces a unique advantage over supervised learning, it does not require the environment to be
static and learns to make similar decisions to intelligent bodies like value of new knowledge
(exploration vs. Exploitation)
Left Brain
© 2018 Copyright: Left Brain
Key concepts in reinforcement learning
Planning an optimal way to achieve an objective
Learning value of being in a state
Learning the policy of how to act in a given state
Learning to engage in an environment by experiencing a representation of the model of that
environment (indirect)
Left Brain
© 2018 Copyright: Left Brain
General policy iteration
Policy Value
Policy evaluation
Policy improvement
Evaluate the policy to get state value
Act greedily on the value to improve the policy
Converge to an optimal policy
converge to
optimal
policy
Left Brain
© 2018 Copyright: Left Brain
Markov decision process
Future is independent of the past given the present
(State, action, reward, discount, exploration)
Agent / controller
Environment / system
state action
reward
new state
1 2
@ a given state
Agent() takes an action
Env() returns a reward and next state
3
4
© 2018 Copyright: Left Brain
Interacting Between Planning Acting And Learning
Value/Polic
y
ExperienceModel
Planning acting
Model Learning
Direct RL
Left Brain
© 2018 Copyright: Left Brain
Interacting Between Planning Acting And Learning
Value/Polic
y
ExperienceModel
Planning acting
Model Learning
Direct RL
You Are Here
Left Brain
CONTRIBUTIONS
Models the environment using an unsupervised, low
dimensional representation
Recurrent mixture model, models the agents actions on the
environment stochastically, which helps the controller
anticipate the next move.
RL is used on the model policy, which is transferable to the
“real” environment
WORLD MODELS
D Ha and J Shmidhubecr
World Model Architecture
• V – VIEW
• M – MODEL
• C - CONTROLLER
V – The View
A Typical Random Image
Actual Images
Set Of All Images
Real images Random images
Z Latent Space
• If real images are a subset of the entire space of images then in theory, we should be able to encode them
with a smaller amount of information than the large space
• This smaller “space” of variables is called the latent space “z”
Autoencoder – Encoder-decoder Network
Some Examples Of Latent Spaces
Z examples
 Http://vecg.Cs.Ucl.Ac.Uk/projects/projects_fonts/projects_fonts.Html
 Https://worldmodels.Github.Io/
So What Is The Use Of Z ?
• Z is a smaller space, so reduces the “state space” the model needs to deal with
• Z-values contain should more meaningful information
• Z-values, trained with enough data, should generalize to novel, unseen yet similar environments
M – The Model
Basics Of Gaussian Mixture Models
Left Brain
How To Solve Problems That Have Multimodal Solutions
y
x
Y = f(x) {discreet solution}
y
x
x = f(y) {multimodal solution}
Left Brain
Propose using a probability of observing the state© 2018 Copyright: Left Brain
MDN-RNN MODEL (M): Predict The Future Z Value
Since the environment is stochastic we model the distribution not discrete values
MDN
RNN
MDN
RNN
MDN
RNN
at-
1
at
at+
1
Zt-
1
Zt
Zt+
1
Zt
Zt+
1
Zt+
2
ht-
1
ht
ht+
1
ht+
2
p(zt+1 | at ,zt ,ht)
Left Brain
temperatu
re
© 2018 Copyright: Left Brain
MDN-RNN MODEL (M): Predict The Future Z Value
Since the environment is stochastic we model the distribution not discrete values
MDN
RNN
MDN
RNN
MDN
RNN
at-
1
at
at+
1
Zt-
1
Zt
Zt+
1
Zt
Zt+
1
Zt+
2
ht-
1
ht
ht+
1
ht+
2
p(zt+1 | at ,zt ,ht)
So now we can predict the future image,
eg. prepare for the car to make a turn
Relationship between Zt and ht will be
trained by the controller
Left Brain
temperatu
re
© 2018 Copyright: Left Brain
Notebook Example Of GMM
Left Brain
C – The Controler
Basics Of Evolution Strategies
Left Brain
Evolution strategies (ES) can best described as a gradient descent method which uses gradients estimated from
stochastic perturbations around the current parameter value
Parameter exploring policy gradient (PEPG)
Reinforce-ES
Natural evolution strategies (NES):weak solutions contain information about what not to do, and this is valuable
information to calculate a better estimate for the next generation
Simple ES: sample a set of solutions from a normal distribution, with a mean μ and a fixed standard deviation σ
Simple genetic ES :genetic algorithms help diversity by keeping track of a diverse set of candidate solutions to
reproduce the next generation.
Open AI solution in ES: keep a constant σ. Does not require calculating a covariance matrix so it is requires less
flops.
Covariance Matrix Adaptation-ES (used in this paper)
Evolution strategies offer an alternative to backpropagation making
them easier to scale across machines
Left Brain
© 2018 Copyright: Left Brain
Covariance Matrix Adaptation - ES
Algorithm utilizes the results of generation to compute the next iteration. It adaptively increase or decreases the search space for the next generation of seed. It
will modulate the mean and the sigma of the parameters. This results in calculating an entirely new covariance matrix of the parameter space at every iteration. At
each generation, the algorithm provides a multivariate normal distribution to sample
References:
 Evolution strategies as a scalable alternative to reinforcement learning(arxiv:1703.03864)
 Blog.Openai.Com/evolution-strategies/
 CMA ES tutorial arxiv:1604.0077
Left Brain
© 2018 Copyright: Left Brain
Notebook Example Of CMA-ES
Left Brain
Explain Covariance Matrix
a1 a2 a3
b1 b2 b3
c1 c2 c3
Tells us how much noise (variance) exists within the
parameter(cov(B,B))
A
B
C
A B C
Tells us how much change in C is captured in A (variance)
exists within the parameter(cov(C,A))
Left Brain
© 2018 Copyright: Left Brain
MDN-RNN MODEL (M): Predict The Future Z Value
Since the environment is stochastic we model the distribution not discrete values
MDN
RNN
MDN
RNN
MDN
RNN
at-
1
at
at+
1
Zt-
1
Zt
Zt+
1
Zt
Zt+
1
Zt+
2
ht-
1
ht
ht+
1
ht+
2
p(zt+1 | at ,zt ,ht)
Left Brain
temperatu
re
© 2018 Copyright: Left Brain
MDN-RNN MODEL (M): Predict The Future Z Value
Since the environment is stochastic we model the distribution not discrete values
MDN
RNN
MDN
RNN
MDN
RNN
at-
1
at
at+
1
Zt-
1
Zt
Zt+
1
Zt
Zt+
1
Zt+
2
ht-
1
ht
ht+
1
ht+
2
p(zt+1 | at ,zt ,ht)
So now we can predict the future image,
eg. prepare for the car to make a turn
Relationship between Zt and ht will be
trained by the controller
Left Brain
temperatu
re
© 2018 Copyright: Left Brain
Results
Experiment 1 CAR Race - ABLATION
 SO DOES THIS WORK?
 TO TEST, OPENAI GYM CAR-RACE ENVIRONMENT WAS USED
 MODEL TOP-SCORED THE LEADERBOARD
 MODEL WAS RUN WITHOUT THE M COMPENENT AND WITH THE M COMPONENT
Car Race Results
Parameter Counts For Car Race
Experiment 2 – The Advantage Of Dreams
• NOTE THAT V AND M ARE TRAINED COMPLETELY BY UNSUPERVISED LEARNING
USING A RANDOM POLICY
• AFTER TRAINING M EFFECTIVELY BECOMES A “SIMULATION” OF THE REAL
ENVIRONMENT
• IF WE TRAIN A POLICY USING M, WILL IT WORK ON THE REAL ENVIRONMENT?
SO WHY USE M AND NOT THE REAL
ENVIRONMENT?
• M RUNS FASTER BECAUSE
• THE DIMENSIONALITY IS REDUCED TO Z
• ITS VECTORIZED AND THEREFORE OPTIMIZED FOR HARDWARE
ACCELERATION
• M IS NOT DETERMINISTIC, AND HAS A “TEMPERATURE”
PARAMETER
GAMING THE
SIMULATION
RANDOMNESS TO REDUCING
• GAMING A SYSTEM GENERALLY RELIES UPON EXPLOITING ”EDGE CASES”
• ADDING “RANDOMNESS” TO A SIMULATION REDUCES THE RELIABILITY OF EDGE
CASES AND MAKES THE UNDERLYING SIMULATION
• IT ALSO CAUSES THE POLICY TO BECOME MORE REDUNDANT AND ROBUST
• HTTPS://BLOG.OPENAI.COM/GENERALIZING-FROM-SIMULATION/
• M COMES WITH A “RANDOMNESS” SLIDER BUILT IN!
• HTTPS://WORLDMODELS.GITHUB.IO
Further the discussion
Have A Longer Conversation On The State Of AI/ML Join Us On A Free Demo
http://training.leftbrain.consulting/
Gain Hands On Experience Programing From Publications
Left Brain

World models v0.14

  • 1.
    WORLD MODELS Presentation ByDuane Nielsen And Pushkar Merwah
  • 2.
    World models Background: ReinforcementLearning World Model Architecture  View  Model  Controller EXPERIMENT 1- OPENAI GYM – Car Racing V0  Ablation Study Experiment 2 – Vizdoom  Training Policy From “World Model Dream"
  • 3.
    Basics of ReinforcementLearning Left Brain
  • 4.
    What is reinforcementlearning Learning expert representation required to achieve an objective, given a success metric, from raw inputs in the absence of domain knowledge. Introduces a unique advantage over supervised learning, it does not require the environment to be static and learns to make similar decisions to intelligent bodies like value of new knowledge (exploration vs. Exploitation) Left Brain © 2018 Copyright: Left Brain
  • 5.
    Key concepts inreinforcement learning Planning an optimal way to achieve an objective Learning value of being in a state Learning the policy of how to act in a given state Learning to engage in an environment by experiencing a representation of the model of that environment (indirect) Left Brain © 2018 Copyright: Left Brain
  • 6.
    General policy iteration PolicyValue Policy evaluation Policy improvement Evaluate the policy to get state value Act greedily on the value to improve the policy Converge to an optimal policy converge to optimal policy Left Brain © 2018 Copyright: Left Brain
  • 7.
    Markov decision process Futureis independent of the past given the present (State, action, reward, discount, exploration) Agent / controller Environment / system state action reward new state 1 2 @ a given state Agent() takes an action Env() returns a reward and next state 3 4 © 2018 Copyright: Left Brain
  • 8.
    Interacting Between PlanningActing And Learning Value/Polic y ExperienceModel Planning acting Model Learning Direct RL Left Brain © 2018 Copyright: Left Brain
  • 9.
    Interacting Between PlanningActing And Learning Value/Polic y ExperienceModel Planning acting Model Learning Direct RL You Are Here Left Brain
  • 10.
    CONTRIBUTIONS Models the environmentusing an unsupervised, low dimensional representation Recurrent mixture model, models the agents actions on the environment stochastically, which helps the controller anticipate the next move. RL is used on the model policy, which is transferable to the “real” environment
  • 11.
    WORLD MODELS D Haand J Shmidhubecr
  • 12.
    World Model Architecture •V – VIEW • M – MODEL • C - CONTROLLER
  • 13.
  • 14.
  • 15.
  • 16.
    Set Of AllImages Real images Random images
  • 17.
    Z Latent Space •If real images are a subset of the entire space of images then in theory, we should be able to encode them with a smaller amount of information than the large space • This smaller “space” of variables is called the latent space “z”
  • 18.
  • 19.
    Some Examples OfLatent Spaces Z examples  Http://vecg.Cs.Ucl.Ac.Uk/projects/projects_fonts/projects_fonts.Html  Https://worldmodels.Github.Io/
  • 20.
    So What IsThe Use Of Z ? • Z is a smaller space, so reduces the “state space” the model needs to deal with • Z-values contain should more meaningful information • Z-values, trained with enough data, should generalize to novel, unseen yet similar environments
  • 21.
    M – TheModel
  • 22.
    Basics Of GaussianMixture Models Left Brain
  • 23.
    How To SolveProblems That Have Multimodal Solutions y x Y = f(x) {discreet solution} y x x = f(y) {multimodal solution} Left Brain Propose using a probability of observing the state© 2018 Copyright: Left Brain
  • 24.
    MDN-RNN MODEL (M):Predict The Future Z Value Since the environment is stochastic we model the distribution not discrete values MDN RNN MDN RNN MDN RNN at- 1 at at+ 1 Zt- 1 Zt Zt+ 1 Zt Zt+ 1 Zt+ 2 ht- 1 ht ht+ 1 ht+ 2 p(zt+1 | at ,zt ,ht) Left Brain temperatu re © 2018 Copyright: Left Brain
  • 25.
    MDN-RNN MODEL (M):Predict The Future Z Value Since the environment is stochastic we model the distribution not discrete values MDN RNN MDN RNN MDN RNN at- 1 at at+ 1 Zt- 1 Zt Zt+ 1 Zt Zt+ 1 Zt+ 2 ht- 1 ht ht+ 1 ht+ 2 p(zt+1 | at ,zt ,ht) So now we can predict the future image, eg. prepare for the car to make a turn Relationship between Zt and ht will be trained by the controller Left Brain temperatu re © 2018 Copyright: Left Brain
  • 26.
    Notebook Example OfGMM Left Brain
  • 27.
    C – TheControler
  • 28.
    Basics Of EvolutionStrategies Left Brain
  • 29.
    Evolution strategies (ES)can best described as a gradient descent method which uses gradients estimated from stochastic perturbations around the current parameter value Parameter exploring policy gradient (PEPG) Reinforce-ES Natural evolution strategies (NES):weak solutions contain information about what not to do, and this is valuable information to calculate a better estimate for the next generation Simple ES: sample a set of solutions from a normal distribution, with a mean μ and a fixed standard deviation σ Simple genetic ES :genetic algorithms help diversity by keeping track of a diverse set of candidate solutions to reproduce the next generation. Open AI solution in ES: keep a constant σ. Does not require calculating a covariance matrix so it is requires less flops. Covariance Matrix Adaptation-ES (used in this paper) Evolution strategies offer an alternative to backpropagation making them easier to scale across machines Left Brain © 2018 Copyright: Left Brain
  • 30.
    Covariance Matrix Adaptation- ES Algorithm utilizes the results of generation to compute the next iteration. It adaptively increase or decreases the search space for the next generation of seed. It will modulate the mean and the sigma of the parameters. This results in calculating an entirely new covariance matrix of the parameter space at every iteration. At each generation, the algorithm provides a multivariate normal distribution to sample References:  Evolution strategies as a scalable alternative to reinforcement learning(arxiv:1703.03864)  Blog.Openai.Com/evolution-strategies/  CMA ES tutorial arxiv:1604.0077 Left Brain © 2018 Copyright: Left Brain
  • 31.
    Notebook Example OfCMA-ES Left Brain
  • 32.
    Explain Covariance Matrix a1a2 a3 b1 b2 b3 c1 c2 c3 Tells us how much noise (variance) exists within the parameter(cov(B,B)) A B C A B C Tells us how much change in C is captured in A (variance) exists within the parameter(cov(C,A)) Left Brain © 2018 Copyright: Left Brain
  • 33.
    MDN-RNN MODEL (M):Predict The Future Z Value Since the environment is stochastic we model the distribution not discrete values MDN RNN MDN RNN MDN RNN at- 1 at at+ 1 Zt- 1 Zt Zt+ 1 Zt Zt+ 1 Zt+ 2 ht- 1 ht ht+ 1 ht+ 2 p(zt+1 | at ,zt ,ht) Left Brain temperatu re © 2018 Copyright: Left Brain
  • 34.
    MDN-RNN MODEL (M):Predict The Future Z Value Since the environment is stochastic we model the distribution not discrete values MDN RNN MDN RNN MDN RNN at- 1 at at+ 1 Zt- 1 Zt Zt+ 1 Zt Zt+ 1 Zt+ 2 ht- 1 ht ht+ 1 ht+ 2 p(zt+1 | at ,zt ,ht) So now we can predict the future image, eg. prepare for the car to make a turn Relationship between Zt and ht will be trained by the controller Left Brain temperatu re © 2018 Copyright: Left Brain
  • 35.
  • 36.
    Experiment 1 CARRace - ABLATION  SO DOES THIS WORK?  TO TEST, OPENAI GYM CAR-RACE ENVIRONMENT WAS USED  MODEL TOP-SCORED THE LEADERBOARD  MODEL WAS RUN WITHOUT THE M COMPENENT AND WITH THE M COMPONENT
  • 37.
  • 38.
  • 39.
    Experiment 2 –The Advantage Of Dreams • NOTE THAT V AND M ARE TRAINED COMPLETELY BY UNSUPERVISED LEARNING USING A RANDOM POLICY • AFTER TRAINING M EFFECTIVELY BECOMES A “SIMULATION” OF THE REAL ENVIRONMENT • IF WE TRAIN A POLICY USING M, WILL IT WORK ON THE REAL ENVIRONMENT?
  • 41.
    SO WHY USEM AND NOT THE REAL ENVIRONMENT? • M RUNS FASTER BECAUSE • THE DIMENSIONALITY IS REDUCED TO Z • ITS VECTORIZED AND THEREFORE OPTIMIZED FOR HARDWARE ACCELERATION • M IS NOT DETERMINISTIC, AND HAS A “TEMPERATURE” PARAMETER
  • 42.
  • 43.
    RANDOMNESS TO REDUCING •GAMING A SYSTEM GENERALLY RELIES UPON EXPLOITING ”EDGE CASES” • ADDING “RANDOMNESS” TO A SIMULATION REDUCES THE RELIABILITY OF EDGE CASES AND MAKES THE UNDERLYING SIMULATION • IT ALSO CAUSES THE POLICY TO BECOME MORE REDUNDANT AND ROBUST • HTTPS://BLOG.OPENAI.COM/GENERALIZING-FROM-SIMULATION/ • M COMES WITH A “RANDOMNESS” SLIDER BUILT IN! • HTTPS://WORLDMODELS.GITHUB.IO
  • 44.
    Further the discussion HaveA Longer Conversation On The State Of AI/ML Join Us On A Free Demo http://training.leftbrain.consulting/ Gain Hands On Experience Programing From Publications Left Brain