When does label smoothing help?
Müller, Rafael, Simon Kornblith, and Geoffrey Hinton
Student reporter: Jane Hsing-Chuan Hsieh
Advisor: Henry Horng-Shing Lu
Date: 2021/10/21
1
Outline
1. Research Context
2. When does label smoothing help?
• Mathematical description of label smoothing
• Penultimate layer representations
• Properties
2
1. Research Context
3
Research context
• Introduction:
1. Label smoothing (LS) has been used successfully to improve the accuracy of
Neural Networks (NN) by computing cross entropy with soft targets rather
than hard targets
• E.g., image classification, speech recognition, and machine translation
• Definition of Label smoothing (LS)
1. uses soft targets that are a weighted average of the hard targets and the
uniform distribution over labels
• E.g, for hard label y = 1, 0, 0  soft label y𝐿𝑆
= 1 − 0.1 +
0.1
3
,
0.1
3
,
0.1
3
= [
2.8
3
,
0.1
3
,
0.1
3
]
4
Source:
Müller, Rafael, Simon Kornblith, and Geoffrey Hinton. "When does label smoothing help?." arXiv preprint arXiv:1906.02629 (2019).
Research context
• Concern:
Despite its widespread use, not much is known about why label smoothing
should work.
• Purpose:
This paper tries to shed light upon behavior of NN trained with label
smoothing
5
2. When does label smoothing
help?
Source:
Müller, Rafael, Simon Kornblith, and Geoffrey Hinton. "When does label smoothing help?." arXiv preprint
arXiv:1906.02629 (2019).
6
Mathematical description of label smoothing
• Softmax function: suppose
prediction of a NN as a function of
the activations in the penultimate
layer (i.e., x)
7
𝑝𝑘: the likelihood the model assigns to the k-th class
𝑤𝑘: the weights (including biases) of the last layer
𝒙
𝒙𝑻
𝒘𝟏
𝒙𝑻
𝒘𝟐
𝒙𝑻𝒘𝟑
penultimate
layer (e.g.,
FC)
(1)
𝑝
Mathematical description of label smoothing
• Cross-entropy: we minimize the expected value of the cross-entropy
between the targets 𝑦𝑘 and the network’s outputs 𝑝𝑘 as in
1. For a network trained with hard targets:
𝑦𝑘 - one-hot encoder
2. For a network trained with soft target:
replace 𝑦𝑘 with 𝑦𝑘
𝐿𝑆
-
E.g., if α =0.1, K=3: y1 = 1, 0, 0  y1
𝐿𝑆
= 1 ∗ 0.9 +
0.1
3
,
0.1
3
,
0.1
3
= [
2.8
3
,
0.1
3
,
0.1
3
] 8
i.e., 𝑦𝑘 is "1" for the correct class and "0" for the rest.
α: a label smoothing of parameter
Penultimate layer representations
9
• Observation
Take (2) into (1), we derive:
𝒙 − 𝒘𝑘
2
= 𝐱T
𝐱 − 2𝐱T
𝒘𝑘 + 𝒘𝑘
𝑇
𝒘𝑘 → 𝐱T𝒘𝑘 =
1
2
(𝐱T𝐱 + 𝒘𝑘
𝑇
𝒘𝑘 − 𝒙 − 𝒘𝑘
2) (2)
Note: 𝐱T
𝐱 is factored and 𝒘𝑘
𝑇
𝒘𝑘 is usually constant across classes in (3)
1. If 𝑐𝑙𝑎𝑠𝑠(𝑥) = 𝑘: Label smoothing encourages 𝒙 (the activations of
the penultimate layer) to be close to 𝒘𝑘 (called ‘template’)
2. For other incorrect classed– ∀𝑙 ≠ 𝑘: Label smoothing encourages
equally distant (distance depending on α) to the templates of the incorrect
classes (i.e., 𝒘𝑙
′
𝑠 ); that is, 𝒙 − 𝒘𝑙 = 𝒙 − 𝒘𝑙′ , ∀𝑙, 𝑙′
≠ 𝑘
𝑝𝑘 =
𝑒−
1
2 𝒙−𝒘𝑘
2
𝑙 e−
1
2 𝒙−𝒘𝑙
2
(3)
Properties
• we visualize penultimate
layer representations of 3-
class image classifiers
trained on different datasets
(rows)
10
• Observation 1:
1. Training w/o LS:
projections are spread
into defined but broad
clusters
2. Training w/ LS: the
clusters are much tighter
• because LS encourages that
each example in training set
to be equidistant from all
the other class’s templates.
Same
acc.
Better
acc.
Properties
11
• Observation 2:
• W/ LS, the difference
between logits (i.e., 𝐱T𝒘)
of two classes has to be
limited in absolute value to
get the desired soft target
for the correct and
incorrect classes
• Note: higher absolute
values represent over-
confident predictions.
LS prevents the network
from becoming over-
confident
ㄒ
Same
acc.
Better
acc.
Properties
12
• Observation 3:
• For validation w/ LS:
Despite trying to collapse
the training examples to
tiny clusters, these NNs
generalize and calibrate
the model
in the validation set each
cluster spreads towards
the center representing
the full range of
confidences for each
prediction.
Same
acc.
Better
acc.
Summary
1. LS encourages representations in the penultimate layer to group in
tight equally distant clusters
2. LS has positive effect on model generalization and calibration.
3. Shortage:
• LS reduces mutual information (the cross-sample information to differentiate
each other), this especially hurts “knowledge distillation”
• i.e., distill the teacher’s network of knowledge into a student network
13

20211019 When does label smoothing help_shared ver

  • 1.
    When does labelsmoothing help? Müller, Rafael, Simon Kornblith, and Geoffrey Hinton Student reporter: Jane Hsing-Chuan Hsieh Advisor: Henry Horng-Shing Lu Date: 2021/10/21 1
  • 2.
    Outline 1. Research Context 2.When does label smoothing help? • Mathematical description of label smoothing • Penultimate layer representations • Properties 2
  • 3.
  • 4.
    Research context • Introduction: 1.Label smoothing (LS) has been used successfully to improve the accuracy of Neural Networks (NN) by computing cross entropy with soft targets rather than hard targets • E.g., image classification, speech recognition, and machine translation • Definition of Label smoothing (LS) 1. uses soft targets that are a weighted average of the hard targets and the uniform distribution over labels • E.g, for hard label y = 1, 0, 0  soft label y𝐿𝑆 = 1 − 0.1 + 0.1 3 , 0.1 3 , 0.1 3 = [ 2.8 3 , 0.1 3 , 0.1 3 ] 4 Source: Müller, Rafael, Simon Kornblith, and Geoffrey Hinton. "When does label smoothing help?." arXiv preprint arXiv:1906.02629 (2019).
  • 5.
    Research context • Concern: Despiteits widespread use, not much is known about why label smoothing should work. • Purpose: This paper tries to shed light upon behavior of NN trained with label smoothing 5
  • 6.
    2. When doeslabel smoothing help? Source: Müller, Rafael, Simon Kornblith, and Geoffrey Hinton. "When does label smoothing help?." arXiv preprint arXiv:1906.02629 (2019). 6
  • 7.
    Mathematical description oflabel smoothing • Softmax function: suppose prediction of a NN as a function of the activations in the penultimate layer (i.e., x) 7 𝑝𝑘: the likelihood the model assigns to the k-th class 𝑤𝑘: the weights (including biases) of the last layer 𝒙 𝒙𝑻 𝒘𝟏 𝒙𝑻 𝒘𝟐 𝒙𝑻𝒘𝟑 penultimate layer (e.g., FC) (1) 𝑝
  • 8.
    Mathematical description oflabel smoothing • Cross-entropy: we minimize the expected value of the cross-entropy between the targets 𝑦𝑘 and the network’s outputs 𝑝𝑘 as in 1. For a network trained with hard targets: 𝑦𝑘 - one-hot encoder 2. For a network trained with soft target: replace 𝑦𝑘 with 𝑦𝑘 𝐿𝑆 - E.g., if α =0.1, K=3: y1 = 1, 0, 0  y1 𝐿𝑆 = 1 ∗ 0.9 + 0.1 3 , 0.1 3 , 0.1 3 = [ 2.8 3 , 0.1 3 , 0.1 3 ] 8 i.e., 𝑦𝑘 is "1" for the correct class and "0" for the rest. α: a label smoothing of parameter
  • 9.
    Penultimate layer representations 9 •Observation Take (2) into (1), we derive: 𝒙 − 𝒘𝑘 2 = 𝐱T 𝐱 − 2𝐱T 𝒘𝑘 + 𝒘𝑘 𝑇 𝒘𝑘 → 𝐱T𝒘𝑘 = 1 2 (𝐱T𝐱 + 𝒘𝑘 𝑇 𝒘𝑘 − 𝒙 − 𝒘𝑘 2) (2) Note: 𝐱T 𝐱 is factored and 𝒘𝑘 𝑇 𝒘𝑘 is usually constant across classes in (3) 1. If 𝑐𝑙𝑎𝑠𝑠(𝑥) = 𝑘: Label smoothing encourages 𝒙 (the activations of the penultimate layer) to be close to 𝒘𝑘 (called ‘template’) 2. For other incorrect classed– ∀𝑙 ≠ 𝑘: Label smoothing encourages equally distant (distance depending on α) to the templates of the incorrect classes (i.e., 𝒘𝑙 ′ 𝑠 ); that is, 𝒙 − 𝒘𝑙 = 𝒙 − 𝒘𝑙′ , ∀𝑙, 𝑙′ ≠ 𝑘 𝑝𝑘 = 𝑒− 1 2 𝒙−𝒘𝑘 2 𝑙 e− 1 2 𝒙−𝒘𝑙 2 (3)
  • 10.
    Properties • we visualizepenultimate layer representations of 3- class image classifiers trained on different datasets (rows) 10 • Observation 1: 1. Training w/o LS: projections are spread into defined but broad clusters 2. Training w/ LS: the clusters are much tighter • because LS encourages that each example in training set to be equidistant from all the other class’s templates. Same acc. Better acc.
  • 11.
    Properties 11 • Observation 2: •W/ LS, the difference between logits (i.e., 𝐱T𝒘) of two classes has to be limited in absolute value to get the desired soft target for the correct and incorrect classes • Note: higher absolute values represent over- confident predictions. LS prevents the network from becoming over- confident ㄒ Same acc. Better acc.
  • 12.
    Properties 12 • Observation 3: •For validation w/ LS: Despite trying to collapse the training examples to tiny clusters, these NNs generalize and calibrate the model in the validation set each cluster spreads towards the center representing the full range of confidences for each prediction. Same acc. Better acc.
  • 13.
    Summary 1. LS encouragesrepresentations in the penultimate layer to group in tight equally distant clusters 2. LS has positive effect on model generalization and calibration. 3. Shortage: • LS reduces mutual information (the cross-sample information to differentiate each other), this especially hurts “knowledge distillation” • i.e., distill the teacher’s network of knowledge into a student network 13

Editor's Notes

  • #2 Lung Cancer
  • #10 分子 numerator 分母 denominator