Info-Wasserstein-GAIL
Yunzhu Li, Jiaming Song, Stefano Ermon, “Inferring The Latent Structure
of Human Decision-Making from Raw Visual Inputs”, ArXiv, 2017
Sungjoon Choi
(sungjoon.choi@cpslab.snu.ac.kr)
Latent Structure of Human Demos
2
Pass / code: 0 Pass / code: 1
Turn/ code: 0 Turn/ code: 1
• Introduction
• Backgrounds
• Generative Adversarial Imitation Learning (GAIL)
• Policy gradient
• InfoGAN
• Wasserstein GAN
• InfoGAIL
• Experiments
Contents
3
• Goal of imitation learning is to match expert
behavior.
• However, demonstrations often show significant
variability due to latent factors.
• This paper presents an Info-GAIL algorithm that
can infer the latent structure of human decision
making.
• This method can not only imitate, but also learn
interpretable representations.
Imitation Learning
4
• The goal of this paper is to develop an imitation
learning framework that is able to autonomously
discover and disentangle the latent factors of
variation underlying human decision making.
• Basically, this paper combines generative
adversarial imitation learning (GAIL), Info GAN,
and Wasserstein GAN with some reward
heuristics
Introduction
5
• We will NOT go into details.
GAIL
6
• But, we will see some basics of policy gradient methods.
Policy Gradient
7
Now we Get rid of expectation
over a policy function!!
Policy Gradient
8
Step-based PG
9
Step-based PG
10
In other words, now we are considering a dynamic model!
Step-based PG
11
We do NOT have to care about
complex models in an MDP, anymore!
Step-based PG (REINFORCE)
12
Now, we have REINFORCE algorithm!
This method has been used in many deep learning methods
where the objective function is NOT differentiable.
Step-based PG (PG)
13
For all trajectories, and for all instances in a trajectory,
the PG is simply weighted MLE where the weight is defined by
the sum of future rewards, or Q value.
• Now, we know where (18) came from, right?
GAIL
14
• Interpretable Imitation Learning
• Utilized information theoretic regularization.
• Simply added InfoGAN to GAIL.
• Utilizing Raw Visual Inputs via Transfer Learning
• Used a Deep Residual Network.
Visual InfoGAIL
15
• Rather than using a single unstructured noise vector,
InfoGAN decomposes the input noise vector into two
parts: (1) z, incompressible noise and (2) c, the latent code
that targets the salient structured semantic features of the
data distribution.
• InfoGAN proposes an information-theoretic regularization:
there should be high mutual information between latent
codes c and generator distribution G(z, c). Thus I(c; G(z, c))
should be high.
InfoGAN
16
• Reward Augmentation
• A general framework to incorporate prior knowledge in imitation
learning by providing additional incentives to the agent without
interfering with the imitation learning process.
• Added a surrogate state-based reward that reflects our biases over
the desired behaviors.
• Can be seen as
• a hybrid between imitation and reinforcement learning
• side information provided to the generator
• Wasserstein GAN (WGAN)
• The discrimination network in WGAN solves a regression problem
instead of a classification problem.
• Suffers less from the vanishing gradient and mode collapse problem.
Improved Optimization
17
• Wasserstein Generative Adversarial Learning
WGAN?
18
Example 1 in WGAN
WGAN, practically
48
• Variance Reduction
• Reduce variance in policy gradient method.
• Replay buffer method with prioritized replay.
• Good for the cases where the rewards are rare.
• Baseline variance reduction methods.
Improved Optimization
49
Finally, InfoGAIL
50
Sample data similar to InfoGAN
Update D similar to WGAN.
Initialize policy from behavior cloning
Update Q similar to GAN or GAIL.
Update Policy with TRPO.
Network Architectures
51
Latent codes are
added to G
Latent codes are also
added to D
Actions are added to D
The posterior network Q adopts the same
architecture as D except that the output is
a softmax over the discrete latent variables,
or factored Gaussian over continuous
latent variables.
Input Image
Action
Disc. Latent Code Cont. Latent Code
G (policy)
Input Image
Action Disc. Latent Code
D (cost)
Score
Input Image
Action Disc. Latent Code
Q (regularizer)
Disc. Latent Code Cont. Latent Code
Train policy function G with TRPO, and iterate.
Experiments
53
Pass / code: 0 Pass / code: 1
Turn/ code: 0 Turn/ code: 1
Experiments
54
InfoGAIL

InfoGAIL

  • 1.
    Info-Wasserstein-GAIL Yunzhu Li, JiamingSong, Stefano Ermon, “Inferring The Latent Structure of Human Decision-Making from Raw Visual Inputs”, ArXiv, 2017 Sungjoon Choi (sungjoon.choi@cpslab.snu.ac.kr)
  • 2.
    Latent Structure ofHuman Demos 2 Pass / code: 0 Pass / code: 1 Turn/ code: 0 Turn/ code: 1
  • 3.
    • Introduction • Backgrounds •Generative Adversarial Imitation Learning (GAIL) • Policy gradient • InfoGAN • Wasserstein GAN • InfoGAIL • Experiments Contents 3
  • 4.
    • Goal ofimitation learning is to match expert behavior. • However, demonstrations often show significant variability due to latent factors. • This paper presents an Info-GAIL algorithm that can infer the latent structure of human decision making. • This method can not only imitate, but also learn interpretable representations. Imitation Learning 4
  • 5.
    • The goalof this paper is to develop an imitation learning framework that is able to autonomously discover and disentangle the latent factors of variation underlying human decision making. • Basically, this paper combines generative adversarial imitation learning (GAIL), Info GAN, and Wasserstein GAN with some reward heuristics Introduction 5
  • 6.
    • We willNOT go into details. GAIL 6 • But, we will see some basics of policy gradient methods.
  • 7.
    Policy Gradient 7 Now weGet rid of expectation over a policy function!!
  • 8.
  • 9.
  • 10.
    Step-based PG 10 In otherwords, now we are considering a dynamic model!
  • 11.
    Step-based PG 11 We doNOT have to care about complex models in an MDP, anymore!
  • 12.
    Step-based PG (REINFORCE) 12 Now,we have REINFORCE algorithm! This method has been used in many deep learning methods where the objective function is NOT differentiable.
  • 13.
    Step-based PG (PG) 13 Forall trajectories, and for all instances in a trajectory, the PG is simply weighted MLE where the weight is defined by the sum of future rewards, or Q value.
  • 14.
    • Now, weknow where (18) came from, right? GAIL 14
  • 15.
    • Interpretable ImitationLearning • Utilized information theoretic regularization. • Simply added InfoGAN to GAIL. • Utilizing Raw Visual Inputs via Transfer Learning • Used a Deep Residual Network. Visual InfoGAIL 15
  • 16.
    • Rather thanusing a single unstructured noise vector, InfoGAN decomposes the input noise vector into two parts: (1) z, incompressible noise and (2) c, the latent code that targets the salient structured semantic features of the data distribution. • InfoGAN proposes an information-theoretic regularization: there should be high mutual information between latent codes c and generator distribution G(z, c). Thus I(c; G(z, c)) should be high. InfoGAN 16
  • 17.
    • Reward Augmentation •A general framework to incorporate prior knowledge in imitation learning by providing additional incentives to the agent without interfering with the imitation learning process. • Added a surrogate state-based reward that reflects our biases over the desired behaviors. • Can be seen as • a hybrid between imitation and reinforcement learning • side information provided to the generator • Wasserstein GAN (WGAN) • The discrimination network in WGAN solves a regression problem instead of a classification problem. • Suffers less from the vanishing gradient and mode collapse problem. Improved Optimization 17
  • 18.
    • Wasserstein GenerativeAdversarial Learning WGAN? 18
  • 20.
  • 48.
  • 49.
    • Variance Reduction •Reduce variance in policy gradient method. • Replay buffer method with prioritized replay. • Good for the cases where the rewards are rare. • Baseline variance reduction methods. Improved Optimization 49
  • 50.
    Finally, InfoGAIL 50 Sample datasimilar to InfoGAN Update D similar to WGAN. Initialize policy from behavior cloning Update Q similar to GAN or GAIL. Update Policy with TRPO.
  • 51.
    Network Architectures 51 Latent codesare added to G Latent codes are also added to D Actions are added to D The posterior network Q adopts the same architecture as D except that the output is a softmax over the discrete latent variables, or factored Gaussian over continuous latent variables.
  • 52.
    Input Image Action Disc. LatentCode Cont. Latent Code G (policy) Input Image Action Disc. Latent Code D (cost) Score Input Image Action Disc. Latent Code Q (regularizer) Disc. Latent Code Cont. Latent Code Train policy function G with TRPO, and iterate.
  • 53.
    Experiments 53 Pass / code:0 Pass / code: 1 Turn/ code: 0 Turn/ code: 1
  • 54.