Meta Dropout:
Learning to Perturb Features for Generalization
Hae Beom Lee¹, Taewook Nam¹, Eunho Yang¹², Sung Ju Hwang¹²
KAIST¹, AITRICS²
Few-shot Learning
Humans can generalize even with a single observation of a class.
[Lake et al. 11] One shot Learning of Simple Visual Concepts, CogSci 2011
Observation
Query examples
Human
Few-shot Learning
On the other hand, deep neural networks require large number of training instances
to generalize well, and overfits with few training instances.
Few-shot
learning
Observation
Deep Neural Networks
How can we learn a model that generalize well even with few training instances?
Human
Query examples
Lack of data results in poor estimation of the decision boundary.
Learning to Perturb Latent Features
train
train
train
Lack of data results in poor estimation of the decision boundary.
Learning to Perturb Latent Features
train
train
train
test
test
test
Lack of data results in poor estimation of the decision boundary.
Learning to Perturb Latent Features
train
train
train
test
test
test
What if we learn to perturb latent features in order to explain the test examples?
𝑝 𝜙(𝒛|𝒙)
𝑝 𝜙(𝒛|𝒙)
𝑝 𝜙(𝒛|𝒙)
Lack of data results in poor estimation of the decision boundary.
Learning to Perturb Latent Features
train
train
train
test
test
test
How to learn
𝝓 ?
𝑝 𝜙(𝒛|𝒙)
𝑝 𝜙(𝒛|𝒙)
𝑝 𝜙(𝒛|𝒙)
What if we learn to perturb latent features in order to explain the test examples?
→ But test examples are not observable in standard learning framework.
Meta-Learning for few-shot classification
Meta Learning: Learn a model that can generalize over a task distribution!
Few-shot Classification
Knowledge
Transfer !
Meta-training
Meta-test
Test
Test
Training Test
Training
Training
: meta-knowledge
𝑝 𝜙(𝒛|𝒙)
[Ravi and Larochelle. 17] Optimization as a Model for Few-shot Learning, ICLR 2017
Model-Agnostic Meta-Learning (MAML)
Model Agnostic Meta Learning (MAML) aims to find a good initial model
parameter that can rapidly adapt to any tasks only with a few gradient steps.
[Finn et al. 17] Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks, ICML 2017
Task-specific
parameter
Task-specific
parameter
Task-specific
parameter
Initial model
parameter
∇ℒ1
∇ℒ2
∇ℒ3
Model-Agnostic Meta-Learning (MAML)
[Finn et al. 17] Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks, ICML 2017
Initial model
parameter
Task-specific
parameter
Task-specific
parameter
Task-specific
parameter
Task-specific parameter
for a novel task
∇ℒ∗
∇ℒ1
∇ℒ2
∇ℒ3
Model Agnostic Meta Learning (MAML) aims to find a good initial model
parameter that can rapidly adapt to any tasks only with a few gradient steps.
Model-Agnostic Meta-Learning (MAML)
train
train
train
𝜃MAML
∗
𝜃
Initial model
parameter
𝜃MAML
∗
∇ 𝜃ℒtrain
Except for sharing the initial model parameter 𝜽, MAML inner-gradient
∇ 𝜃ℒtrain
does not involve any further knowledge than 𝐷train
,
which may result in suboptimal decision boundaries at the end of task adaptation.
MAML + Meta Dropout
Initial model
parameter
train
train
train
𝜃MetaDrop
∗
𝜃MAML
∗
𝜃MAML
∗
𝜃
𝜃MetaDrop
∗
∇ 𝜃 𝔼 𝑝 𝜙(𝒁|𝑿) ℒtrain
∇ 𝜃ℒtrain
𝑝 𝜙(𝒛|𝒙)
𝑝 𝜙(𝒛|𝒙)
𝑝 𝜙(𝒛|𝒙)
In Meta-dropout, we introduce the input-dependent noise distribution, and
computes the gradients of the expected loss over the noise distribution.
It improves the final model decision boundary at the end of the task adaptation.
Model Architecture
Initial model
parameter
𝜃MAML
∗
𝜃
𝜃MetaDrop
∗
∇ 𝜃 𝔼 𝑝 𝜙(𝒁|𝑿) ℒtrain
∇ 𝜃ℒtrain
Multiplicative
noise
Conv
FC
Noise 𝒛
Main model
: 4-conv net
Noise 𝒛
𝜙
We let each bottom layer generate the noise for the upper layer, and the form of
the multiplicative noise is a softplus transformation of a Gaussian distribution.
Learning Objective
Meta-learning → Maximize the performance of the test examples.
Test log-likelihood
No perturb
Inner-gradient step
Train 1
Train 2
Train 1
Train 2
Test
Test
Generalization Performance
miniImageNet
5-way MAML Meta-dropout
1-shot
5-shot
49.58%
64.55%
51.93%
67.42%
Train 1
Train 2
Train 1
Train 2
Test
Test
Visualization of Stochastic Features
Original Images Stochastic
Channel 1
Stochastic
Channel 2
Comparison against Existing Regularizers
Models
Omniglot miniImageNet
1shot 5shot 1shot 5shot
No Perturbation 95.23 98.38 49.58 64.55
Manifold Mixup 89.78 97.86 48.62 63.86
Variational Information Bottleneck 94.98 98.85 48.12 64.78
Information Dropout 94.49 98.65 50.36 65.91
Meta-dropout 96.63 99.04 51.93 67.42
Meta-dropout outperforms the existing regularizers such as manifold mixup
and information-theoretic regularizers.
Adversarial Robustness
Meta-dropout Improves both clean and adversarial accuracies.
Omniglot 20-way 1-shot
𝐿∞-norm attack
adversarial
Meta-dropout
MAML
clean
Adversarial Robustness
The defense of Meta-dropout also generalizes across different attacks.
𝐿∞-norm attack 𝐿1-norm attack 𝐿2-norm attack
Omniglot 20-way 1-shot
Summary
• In this work, we showed that we can learn to perturb latent features in input-
dependent manner, in order to improve generalization.
• Meta-learning framework enables the effective learning of the perturbation
function.
• Meta-dropout outperforms the existing regularizers in meta-learning settings.
• Meta-dropout Improves both clean and adversarial accuracies on various
types of attacks.

Meta Dropout: Learning to Perturb Latent Features for Generalization

  • 1.
    Meta Dropout: Learning toPerturb Features for Generalization Hae Beom Lee¹, Taewook Nam¹, Eunho Yang¹², Sung Ju Hwang¹² KAIST¹, AITRICS²
  • 2.
    Few-shot Learning Humans cangeneralize even with a single observation of a class. [Lake et al. 11] One shot Learning of Simple Visual Concepts, CogSci 2011 Observation Query examples Human
  • 3.
    Few-shot Learning On theother hand, deep neural networks require large number of training instances to generalize well, and overfits with few training instances. Few-shot learning Observation Deep Neural Networks How can we learn a model that generalize well even with few training instances? Human Query examples
  • 4.
    Lack of dataresults in poor estimation of the decision boundary. Learning to Perturb Latent Features train train train
  • 5.
    Lack of dataresults in poor estimation of the decision boundary. Learning to Perturb Latent Features train train train test test test
  • 6.
    Lack of dataresults in poor estimation of the decision boundary. Learning to Perturb Latent Features train train train test test test What if we learn to perturb latent features in order to explain the test examples? 𝑝 𝜙(𝒛|𝒙) 𝑝 𝜙(𝒛|𝒙) 𝑝 𝜙(𝒛|𝒙)
  • 7.
    Lack of dataresults in poor estimation of the decision boundary. Learning to Perturb Latent Features train train train test test test How to learn 𝝓 ? 𝑝 𝜙(𝒛|𝒙) 𝑝 𝜙(𝒛|𝒙) 𝑝 𝜙(𝒛|𝒙) What if we learn to perturb latent features in order to explain the test examples? → But test examples are not observable in standard learning framework.
  • 8.
    Meta-Learning for few-shotclassification Meta Learning: Learn a model that can generalize over a task distribution! Few-shot Classification Knowledge Transfer ! Meta-training Meta-test Test Test Training Test Training Training : meta-knowledge 𝑝 𝜙(𝒛|𝒙) [Ravi and Larochelle. 17] Optimization as a Model for Few-shot Learning, ICLR 2017
  • 9.
    Model-Agnostic Meta-Learning (MAML) ModelAgnostic Meta Learning (MAML) aims to find a good initial model parameter that can rapidly adapt to any tasks only with a few gradient steps. [Finn et al. 17] Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks, ICML 2017 Task-specific parameter Task-specific parameter Task-specific parameter Initial model parameter ∇ℒ1 ∇ℒ2 ∇ℒ3
  • 10.
    Model-Agnostic Meta-Learning (MAML) [Finnet al. 17] Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks, ICML 2017 Initial model parameter Task-specific parameter Task-specific parameter Task-specific parameter Task-specific parameter for a novel task ∇ℒ∗ ∇ℒ1 ∇ℒ2 ∇ℒ3 Model Agnostic Meta Learning (MAML) aims to find a good initial model parameter that can rapidly adapt to any tasks only with a few gradient steps.
  • 11.
    Model-Agnostic Meta-Learning (MAML) train train train 𝜃MAML ∗ 𝜃 Initialmodel parameter 𝜃MAML ∗ ∇ 𝜃ℒtrain Except for sharing the initial model parameter 𝜽, MAML inner-gradient ∇ 𝜃ℒtrain does not involve any further knowledge than 𝐷train , which may result in suboptimal decision boundaries at the end of task adaptation.
  • 12.
    MAML + MetaDropout Initial model parameter train train train 𝜃MetaDrop ∗ 𝜃MAML ∗ 𝜃MAML ∗ 𝜃 𝜃MetaDrop ∗ ∇ 𝜃 𝔼 𝑝 𝜙(𝒁|𝑿) ℒtrain ∇ 𝜃ℒtrain 𝑝 𝜙(𝒛|𝒙) 𝑝 𝜙(𝒛|𝒙) 𝑝 𝜙(𝒛|𝒙) In Meta-dropout, we introduce the input-dependent noise distribution, and computes the gradients of the expected loss over the noise distribution. It improves the final model decision boundary at the end of the task adaptation.
  • 13.
    Model Architecture Initial model parameter 𝜃MAML ∗ 𝜃 𝜃MetaDrop ∗ ∇𝜃 𝔼 𝑝 𝜙(𝒁|𝑿) ℒtrain ∇ 𝜃ℒtrain Multiplicative noise Conv FC Noise 𝒛 Main model : 4-conv net Noise 𝒛 𝜙 We let each bottom layer generate the noise for the upper layer, and the form of the multiplicative noise is a softplus transformation of a Gaussian distribution.
  • 14.
    Learning Objective Meta-learning →Maximize the performance of the test examples. Test log-likelihood No perturb Inner-gradient step
  • 15.
    Train 1 Train 2 Train1 Train 2 Test Test Generalization Performance miniImageNet 5-way MAML Meta-dropout 1-shot 5-shot 49.58% 64.55% 51.93% 67.42% Train 1 Train 2 Train 1 Train 2 Test Test
  • 16.
    Visualization of StochasticFeatures Original Images Stochastic Channel 1 Stochastic Channel 2
  • 17.
    Comparison against ExistingRegularizers Models Omniglot miniImageNet 1shot 5shot 1shot 5shot No Perturbation 95.23 98.38 49.58 64.55 Manifold Mixup 89.78 97.86 48.62 63.86 Variational Information Bottleneck 94.98 98.85 48.12 64.78 Information Dropout 94.49 98.65 50.36 65.91 Meta-dropout 96.63 99.04 51.93 67.42 Meta-dropout outperforms the existing regularizers such as manifold mixup and information-theoretic regularizers.
  • 18.
    Adversarial Robustness Meta-dropout Improvesboth clean and adversarial accuracies. Omniglot 20-way 1-shot 𝐿∞-norm attack adversarial Meta-dropout MAML clean
  • 19.
    Adversarial Robustness The defenseof Meta-dropout also generalizes across different attacks. 𝐿∞-norm attack 𝐿1-norm attack 𝐿2-norm attack Omniglot 20-way 1-shot
  • 20.
    Summary • In thiswork, we showed that we can learn to perturb latent features in input- dependent manner, in order to improve generalization. • Meta-learning framework enables the effective learning of the perturbation function. • Meta-dropout outperforms the existing regularizers in meta-learning settings. • Meta-dropout Improves both clean and adversarial accuracies on various types of attacks.