Character Controllers using Motion VAEs
ACM Transactions on Graphics (TOG) (Proc. SIGGRAPH 2020)
Hung Yu Ling, Fabio Zinno, George Cheng, Michiel van de Panne
Dongmin Lee
Control & Animation Laboratory
Hanyang University
October, 2020
Outline
• Introduction
• Motion VAEs
• Motion Synthesis
• Random Walk
• Sampling-based Control
• Learning Control Policies
• Locomotion Controllers
• Target
• Joystick Control
• Path Follower
• Maze Runner
1
Introduction
2
Given example motions, how can we generalize these to produce new
purposeful motions?
Introduction
3
Given example motions, how can we generalize these to produce new
purposeful motions?
We take a two-step approach to this problem
• Kinematic generative model based on an autoregressive conditional
variational autoencoder or motion VAE (MVAE)
• Learning controller to generate desired motions using Deep Reinforcement
Learning (Deep RL)
Introduction
4
Given example motions, how can we generalize these to produce new
purposeful motions?
We take a two-step approach to this problem
• Kinematic generative model based on an autoregressive conditional
variational autoencoder or motion VAE (MVAE)
• Learning controller to generate desired motions using Deep Reinforcement
Learning (Deep RL)
What is a Variational Autoencoder (VAE)?
Introduction
5
Variational Autoencoder (VAE)
• Given 𝐷 = {𝑥!}!"#
$
, we want to have 𝑝% 𝑥 to sample 𝑥 ~ 𝑝% 𝑥
• How? 𝜃 = arg max
%
log 𝑝%(𝐷|𝜃)
• 𝑝 𝑧 ~ 𝑧 → 𝑥: 𝑝% 𝑥 𝑧 (decoder)
• 𝑝% 𝑥 = ∫ 𝑝% 𝑥|𝑧 𝑝 𝑧 𝑑𝑧
𝑥𝑝 𝑧 ~ 𝑧
Latent variable Target data
Introduction
6
Variational Autoencoder (VAE)
• 𝑧 ~ 𝑝(𝑧|𝑥) ≈ 𝑞& 𝑧 𝑥 , 𝑥 → 𝑧: 𝑞& 𝑧 𝑥 (encoder)
Variational Inference
𝐾𝐿(𝑞(𝑧|𝑥)||𝑝 𝑧 𝑥 )
Introduction
7
Variational Autoencoder (VAE)
• Loss function
Introduction
8
Given example motions, how can we generalize these to produce new
purposeful motions?
We take a two-step approach to this problem
• Kinematic generative model based on an autoregressive conditional
variational autoencoder or motion VAE (MVAE)
• Learning controller to generate desired motions using Deep Reinforcement
Learning (Deep RL)
What is a Variational Autoencoder (VAE)?
• Deep generative model for learning latent representations
Introduction
9
Given example motions, how can we generalize these to produce new
purposeful motions?
We take a two-step approach to this problem
• Kinematic generative model based on an autoregressive conditional
variational autoencoder or motion VAE (MVAE)
• Learning controller to generate desired motions using Deep Reinforcement
Learning (Deep RL)
What is a Variational Autoencoder (VAE)?
• Deep generative model for learning latent representations
What is Deep Reinforcement Learning (Deep RL)?
• Deep RL = RL + Deep learning
• RL: sequential decision making interacting with environments
• Deep learning: representation of policies and value functions
Motion VAEs
10
Autoregressive conditional variational autoencoder or Motion VAE (MVAE)
• Autoencoder extracts features using Encoder and Decoder
Motion VAEs
11
Autoregressive conditional variational autoencoder or Motion VAE (MVAE)
• Encoder outputs a latent distribution
• Decoder input is a latent variable 𝑧 sampled in the latent distribution
Motion VAEs
12
Autoregressive conditional variational autoencoder or Motion VAE (MVAE)
• Use previous pose as a condition
Motion VAEs
13
Autoregressive conditional variational autoencoder or Motion VAE (MVAE)
• Predicted pose feeds back as input for next cycle
Motion VAEs
14
Autoregressive conditional variational autoencoder or Motion VAE (MVAE)
• Supervised training grounded with a motion database
Motion VAEs
15
Autoregressive conditional variational autoencoder or Motion VAE (MVAE)
• Encoder network
• A pose 𝑝: ( ̇𝑟!
, ̇𝑟"
, ̇𝑟#
, 𝑗$
, 𝑗%
, 𝑗&
),
• ( ̇𝑟*
, ̇𝑟+
, ̇𝑟,
∈ ℝ): the character’s linear and angular velocities
• (𝑗-, 𝑗. ∈ ℝ/): the joint positions and velocities
• (𝑗0 ∈ ℝ1): the joint orientations represented using their forward and upward vectors
• Inputs the previous pose 𝑝'() and current pose 𝑝' and outputs 𝜇, 𝜎
• Three-layer feed-forward neural network (256 hidden units, ELU activations)
• 32 latent dimension
Motion VAEs
16
Autoregressive conditional variational autoencoder or Motion VAE (MVAE)
• Decoder network
• Mixture-of-expert (MoE) architecture
• Partition the input space between a fixed number of expert networks
• 6 expert networks and a single gating network
• Gating network
• Decides which expert to use for each input region
• Inputs the latent variable 𝑧 and the previous pose 𝑝234
• 6 expert networks
• 𝑧 is used as input to each layer to help prevent posterior collapse
• Three-layer feed-forward neural network (256 hidden units, ELU activations)
MoE architectureDecoder network
Motion VAEs
17
MVAE training
• Motion capture database (about 30,000 frames)
• 17 minutes of walking, running, turning, dynamic stopping, and resting motions
• Learning procedure using 𝛽-VAE (𝛽 = 0.2)
• The objective is to minimize the reconstruction and KL-divergence losses
• In 𝛽-VAE, 𝛽 is to strike a balance between reconstruction (motion quality) and
KL-divergence (motion generalization)
• Learning rate is 10(*
, mini-batch size is 64, and learning time is roughly 2 hours
(Nvidia GeForce GTX 1060 and Intel i7-5960X CPU)
Motion VAEs
18
MVAE training
• Stable sequence prediction
• The trained MVAE suffers from unstable predictions because of autoregressive
predictions that can rapidly cause the MVAE to enter a new and unrecoverable
region of the state space
• Uses scheduled sampling that introduce a sample probability 𝑝 defined for each
training epoch
• 3 modes: supervised learning (𝑝 = 1, epochs = 20), scheduled sampling (decaying 𝑝,
epochs = 20), and autoregressive prediction (𝑝 = 0, epochs =140)
Motion Synthesis
19
Random walk
• Uses random samples from the MVAE latent distribution
• Our mocap database had an insufficient number of turning examples
à A particular motion has no transition to other motions
Random walks visualized for 6 different initial conditions (8 characters, 300 time steps)
Motion Synthesis
20
Random walk
• Uses random samples from the MVAE latent distribution
• Our mocap database had an insufficient number of turning examples
à A particular motion has no transition to other motions
Sampling-based control
• Multiple Monte Carlo roll-outs (𝑁 = 200) for a fixed horizon (𝐻 = 4)
• When compared to policies learned with RL, the policy has difficulty
directing the character to reach within two feet of the target
• For more difficult tasks (joystick, path follower), the policy is unable to
achieve the desired goals
Learning Control Policies
21
Deep RL
• Note that the latent variables 𝑧 is treated as an action space
• Proximal policy optimization (PPO) algorithm is used as the DRL algorithm
• Control Network
• Two hidden-layer neural network, 256 hidden units, ReLU activations
• Output layer: normalization with Tanh activation, scaling to -4 ~ 4
• The policy and value networks are updated in mini-batches of 1000 samples
• All tasks be trained within 1 to 6 hours
Locomotion Controllers
22
Locomotion tasks: target, joystick control, path follower, maze runner
• Target
• The goal is to navigate towards a target
• The character reaches the target if its pelvis is within two feet of the target
• The task environment is 120×80 feet
• The reward 𝑟(𝑠, 𝑎) is a distance between the root and the target and a bonus
reward for reaching the target
Locomotion Controllers
23
Locomotion tasks: target, joystick control, path follower, maze runner
• Joystick control
• The goal is to change the character’s heading direction and speed to match the
direction of the joystick
• The desired direction is uniformly sampled between 0 and 2𝜋 every 120 frames
• The desired speed is uniformly selected between 0 and 24 feet per second every
240 frames.
• The reward equation is 𝑟+,"-'./0 = 𝑒123 &!(&"
!
()
×𝑒(| ̇&( ̇&"|
Locomotion Controllers
24
Locomotion tasks: target, joystick control, path follower, maze runner
• Path follower (extension of the target task)
• The goal is to follow a predefined 2D path
• The character sees multiple targets (𝑁 = 4), each spaced 15 time steps apart
• The following left figure is a parametric figure, given by 𝑥 = 𝐴𝑠𝑖𝑛 𝑏𝑡 ,
𝑦 = 𝐴𝑠𝑖𝑛 𝑏𝑡 cos(𝑏𝑡) where 𝑡 ∈ 0,2𝜋 , 𝐴 = 50, 𝑏 = 2
• The time step is discretized into 1200 equal steps
Locomotion Controllers
25
Locomotion tasks: target, joystick control, path follower, maze runner
• Maze runner
• The goal is to explore the maze without collision
• The maze is fully enclosed without an entrance or exit
• The maze size is a square of 160×160 feet and the total allotted time is 1500 steps
• 32×32 equal sectors in the maze is divided as a predefined exploration reward
• When the character hits any of the walls or the allotted time steps of 1500 are
exhausted, the task is terminated
• A vision system is used to navigate in the environment, casting 16 light rays
• Hierarchical RL is used to solve this task, using high-level controller (HLC) and low-
level controller (LLC), similar to DeepLoco
Thank You!
Any Questions?

Character Controllers using Motion VAEs

  • 1.
    Character Controllers usingMotion VAEs ACM Transactions on Graphics (TOG) (Proc. SIGGRAPH 2020) Hung Yu Ling, Fabio Zinno, George Cheng, Michiel van de Panne Dongmin Lee Control & Animation Laboratory Hanyang University October, 2020
  • 2.
    Outline • Introduction • MotionVAEs • Motion Synthesis • Random Walk • Sampling-based Control • Learning Control Policies • Locomotion Controllers • Target • Joystick Control • Path Follower • Maze Runner 1
  • 3.
    Introduction 2 Given example motions,how can we generalize these to produce new purposeful motions?
  • 4.
    Introduction 3 Given example motions,how can we generalize these to produce new purposeful motions? We take a two-step approach to this problem • Kinematic generative model based on an autoregressive conditional variational autoencoder or motion VAE (MVAE) • Learning controller to generate desired motions using Deep Reinforcement Learning (Deep RL)
  • 5.
    Introduction 4 Given example motions,how can we generalize these to produce new purposeful motions? We take a two-step approach to this problem • Kinematic generative model based on an autoregressive conditional variational autoencoder or motion VAE (MVAE) • Learning controller to generate desired motions using Deep Reinforcement Learning (Deep RL) What is a Variational Autoencoder (VAE)?
  • 6.
    Introduction 5 Variational Autoencoder (VAE) •Given 𝐷 = {𝑥!}!"# $ , we want to have 𝑝% 𝑥 to sample 𝑥 ~ 𝑝% 𝑥 • How? 𝜃 = arg max % log 𝑝%(𝐷|𝜃) • 𝑝 𝑧 ~ 𝑧 → 𝑥: 𝑝% 𝑥 𝑧 (decoder) • 𝑝% 𝑥 = ∫ 𝑝% 𝑥|𝑧 𝑝 𝑧 𝑑𝑧 𝑥𝑝 𝑧 ~ 𝑧 Latent variable Target data
  • 7.
    Introduction 6 Variational Autoencoder (VAE) •𝑧 ~ 𝑝(𝑧|𝑥) ≈ 𝑞& 𝑧 𝑥 , 𝑥 → 𝑧: 𝑞& 𝑧 𝑥 (encoder) Variational Inference 𝐾𝐿(𝑞(𝑧|𝑥)||𝑝 𝑧 𝑥 )
  • 8.
  • 9.
    Introduction 8 Given example motions,how can we generalize these to produce new purposeful motions? We take a two-step approach to this problem • Kinematic generative model based on an autoregressive conditional variational autoencoder or motion VAE (MVAE) • Learning controller to generate desired motions using Deep Reinforcement Learning (Deep RL) What is a Variational Autoencoder (VAE)? • Deep generative model for learning latent representations
  • 10.
    Introduction 9 Given example motions,how can we generalize these to produce new purposeful motions? We take a two-step approach to this problem • Kinematic generative model based on an autoregressive conditional variational autoencoder or motion VAE (MVAE) • Learning controller to generate desired motions using Deep Reinforcement Learning (Deep RL) What is a Variational Autoencoder (VAE)? • Deep generative model for learning latent representations What is Deep Reinforcement Learning (Deep RL)? • Deep RL = RL + Deep learning • RL: sequential decision making interacting with environments • Deep learning: representation of policies and value functions
  • 11.
    Motion VAEs 10 Autoregressive conditionalvariational autoencoder or Motion VAE (MVAE) • Autoencoder extracts features using Encoder and Decoder
  • 12.
    Motion VAEs 11 Autoregressive conditionalvariational autoencoder or Motion VAE (MVAE) • Encoder outputs a latent distribution • Decoder input is a latent variable 𝑧 sampled in the latent distribution
  • 13.
    Motion VAEs 12 Autoregressive conditionalvariational autoencoder or Motion VAE (MVAE) • Use previous pose as a condition
  • 14.
    Motion VAEs 13 Autoregressive conditionalvariational autoencoder or Motion VAE (MVAE) • Predicted pose feeds back as input for next cycle
  • 15.
    Motion VAEs 14 Autoregressive conditionalvariational autoencoder or Motion VAE (MVAE) • Supervised training grounded with a motion database
  • 16.
    Motion VAEs 15 Autoregressive conditionalvariational autoencoder or Motion VAE (MVAE) • Encoder network • A pose 𝑝: ( ̇𝑟! , ̇𝑟" , ̇𝑟# , 𝑗$ , 𝑗% , 𝑗& ), • ( ̇𝑟* , ̇𝑟+ , ̇𝑟, ∈ ℝ): the character’s linear and angular velocities • (𝑗-, 𝑗. ∈ ℝ/): the joint positions and velocities • (𝑗0 ∈ ℝ1): the joint orientations represented using their forward and upward vectors • Inputs the previous pose 𝑝'() and current pose 𝑝' and outputs 𝜇, 𝜎 • Three-layer feed-forward neural network (256 hidden units, ELU activations) • 32 latent dimension
  • 17.
    Motion VAEs 16 Autoregressive conditionalvariational autoencoder or Motion VAE (MVAE) • Decoder network • Mixture-of-expert (MoE) architecture • Partition the input space between a fixed number of expert networks • 6 expert networks and a single gating network • Gating network • Decides which expert to use for each input region • Inputs the latent variable 𝑧 and the previous pose 𝑝234 • 6 expert networks • 𝑧 is used as input to each layer to help prevent posterior collapse • Three-layer feed-forward neural network (256 hidden units, ELU activations) MoE architectureDecoder network
  • 18.
    Motion VAEs 17 MVAE training •Motion capture database (about 30,000 frames) • 17 minutes of walking, running, turning, dynamic stopping, and resting motions • Learning procedure using 𝛽-VAE (𝛽 = 0.2) • The objective is to minimize the reconstruction and KL-divergence losses • In 𝛽-VAE, 𝛽 is to strike a balance between reconstruction (motion quality) and KL-divergence (motion generalization) • Learning rate is 10(* , mini-batch size is 64, and learning time is roughly 2 hours (Nvidia GeForce GTX 1060 and Intel i7-5960X CPU)
  • 19.
    Motion VAEs 18 MVAE training •Stable sequence prediction • The trained MVAE suffers from unstable predictions because of autoregressive predictions that can rapidly cause the MVAE to enter a new and unrecoverable region of the state space • Uses scheduled sampling that introduce a sample probability 𝑝 defined for each training epoch • 3 modes: supervised learning (𝑝 = 1, epochs = 20), scheduled sampling (decaying 𝑝, epochs = 20), and autoregressive prediction (𝑝 = 0, epochs =140)
  • 20.
    Motion Synthesis 19 Random walk •Uses random samples from the MVAE latent distribution • Our mocap database had an insufficient number of turning examples à A particular motion has no transition to other motions Random walks visualized for 6 different initial conditions (8 characters, 300 time steps)
  • 21.
    Motion Synthesis 20 Random walk •Uses random samples from the MVAE latent distribution • Our mocap database had an insufficient number of turning examples à A particular motion has no transition to other motions Sampling-based control • Multiple Monte Carlo roll-outs (𝑁 = 200) for a fixed horizon (𝐻 = 4) • When compared to policies learned with RL, the policy has difficulty directing the character to reach within two feet of the target • For more difficult tasks (joystick, path follower), the policy is unable to achieve the desired goals
  • 22.
    Learning Control Policies 21 DeepRL • Note that the latent variables 𝑧 is treated as an action space • Proximal policy optimization (PPO) algorithm is used as the DRL algorithm • Control Network • Two hidden-layer neural network, 256 hidden units, ReLU activations • Output layer: normalization with Tanh activation, scaling to -4 ~ 4 • The policy and value networks are updated in mini-batches of 1000 samples • All tasks be trained within 1 to 6 hours
  • 23.
    Locomotion Controllers 22 Locomotion tasks:target, joystick control, path follower, maze runner • Target • The goal is to navigate towards a target • The character reaches the target if its pelvis is within two feet of the target • The task environment is 120×80 feet • The reward 𝑟(𝑠, 𝑎) is a distance between the root and the target and a bonus reward for reaching the target
  • 24.
    Locomotion Controllers 23 Locomotion tasks:target, joystick control, path follower, maze runner • Joystick control • The goal is to change the character’s heading direction and speed to match the direction of the joystick • The desired direction is uniformly sampled between 0 and 2𝜋 every 120 frames • The desired speed is uniformly selected between 0 and 24 feet per second every 240 frames. • The reward equation is 𝑟+,"-'./0 = 𝑒123 &!(&" ! () ×𝑒(| ̇&( ̇&"|
  • 25.
    Locomotion Controllers 24 Locomotion tasks:target, joystick control, path follower, maze runner • Path follower (extension of the target task) • The goal is to follow a predefined 2D path • The character sees multiple targets (𝑁 = 4), each spaced 15 time steps apart • The following left figure is a parametric figure, given by 𝑥 = 𝐴𝑠𝑖𝑛 𝑏𝑡 , 𝑦 = 𝐴𝑠𝑖𝑛 𝑏𝑡 cos(𝑏𝑡) where 𝑡 ∈ 0,2𝜋 , 𝐴 = 50, 𝑏 = 2 • The time step is discretized into 1200 equal steps
  • 26.
    Locomotion Controllers 25 Locomotion tasks:target, joystick control, path follower, maze runner • Maze runner • The goal is to explore the maze without collision • The maze is fully enclosed without an entrance or exit • The maze size is a square of 160×160 feet and the total allotted time is 1500 steps • 32×32 equal sectors in the maze is divided as a predefined exploration reward • When the character hits any of the walls or the allotted time steps of 1500 are exhausted, the task is terminated • A vision system is used to navigate in the environment, casting 16 light rays • Hierarchical RL is used to solve this task, using high-level controller (HLC) and low- level controller (LLC), similar to DeepLoco
  • 27.