20211019 When does label smoothing help_shared ver

When does label smoothing help?
Müller, Rafael, Simon Kornblith, and Geoffrey Hinton
Student reporter: Jane Hsing-Chuan Hsieh
Advisor: Henry Horng-Shing Lu
Date: 2021/10/21
1

Outline
1. Research Context
2. When does label smoothing help?
• Mathematical description of label smoothing
• Penultimate layer representations
• Properties
2

Research context
• Introduction:
1. Label smoothing (LS) has been used successfully to improve the accuracy of
Neural Networks (NN) by computing cross entropy with soft targets rather
than hard targets
• E.g., image classification, speech recognition, and machine translation
• Definition of Label smoothing (LS)
1. uses soft targets that are a weighted average of the hard targets and the
uniform distribution over labels
• E.g, for hard label y = 1, 0, 0  soft label y𝐿𝑆
= 1 − 0.1 +
0.1
3
,
0.1
3
,
0.1
3
= [
2.8
3
,
0.1
3
,
0.1
3
]
4
Source:
Müller, Rafael, Simon Kornblith, and Geoffrey Hinton. "When does label smoothing help?." arXiv preprint arXiv:1906.02629 (2019).

Research context
• Concern:
Despite its widespread use, not much is known about why label smoothing
should work.
• Purpose:
This paper tries to shed light upon behavior of NN trained with label
smoothing
5

2. When does label smoothing
help?
Source:
Müller, Rafael, Simon Kornblith, and Geoffrey Hinton. "When does label smoothing help?." arXiv preprint
arXiv:1906.02629 (2019).
6

Mathematical description of label smoothing
• Softmax function: suppose
prediction of a NN as a function of
the activations in the penultimate
layer (i.e., x)
7
𝑝𝑘: the likelihood the model assigns to the k-th class
𝑤𝑘: the weights (including biases) of the last layer
𝒙
𝒙𝑻
𝒘𝟏
𝒙𝑻
𝒘𝟐
𝒙𝑻𝒘𝟑
penultimate
layer (e.g.,
FC)
(1)
𝑝

Mathematical description of label smoothing
• Cross-entropy: we minimize the expected value of the cross-entropy
between the targets 𝑦𝑘 and the network’s outputs 𝑝𝑘 as in
1. For a network trained with hard targets:
𝑦𝑘 - one-hot encoder
2. For a network trained with soft target:
replace 𝑦𝑘 with 𝑦𝑘
𝐿𝑆
-
E.g., if α =0.1, K=3: y1 = 1, 0, 0  y1
𝐿𝑆
= 1 ∗ 0.9 +
0.1
3
,
0.1
3
,
0.1
3
= [
2.8
3
,
0.1
3
,
0.1
3
] 8
i.e., 𝑦𝑘 is "1" for the correct class and "0" for the rest.
α: a label smoothing of parameter

Penultimate layer representations
9
• Observation
Take (2) into (1), we derive:
𝒙 − 𝒘𝑘
2
= 𝐱T
𝐱 − 2𝐱T
𝒘𝑘 + 𝒘𝑘
𝑇
𝒘𝑘 → 𝐱T𝒘𝑘 =
1
2
(𝐱T𝐱 + 𝒘𝑘
𝑇
𝒘𝑘 − 𝒙 − 𝒘𝑘
2) (2)
Note: 𝐱T
𝐱 is factored and 𝒘𝑘
𝑇
𝒘𝑘 is usually constant across classes in (3)
1. If 𝑐𝑙𝑎𝑠𝑠(𝑥) = 𝑘: Label smoothing encourages 𝒙 (the activations of
the penultimate layer) to be close to 𝒘𝑘 (called ‘template’)
2. For other incorrect classed– ∀𝑙 ≠ 𝑘: Label smoothing encourages
equally distant (distance depending on α) to the templates of the incorrect
classes (i.e., 𝒘𝑙
′
𝑠 ); that is, 𝒙 − 𝒘𝑙 = 𝒙 − 𝒘𝑙′ , ∀𝑙, 𝑙′
≠ 𝑘
𝑝𝑘 =
𝑒−
1
2 𝒙−𝒘𝑘
2
𝑙 e−
1
2 𝒙−𝒘𝑙
2
(3)

Properties
• we visualize penultimate
layer representations of 3-
class image classifiers
trained on different datasets
(rows)
10
• Observation 1:
1. Training w/o LS:
projections are spread
into defined but broad
clusters
2. Training w/ LS: the
clusters are much tighter
• because LS encourages that
each example in training set
to be equidistant from all
the other class’s templates.
Same
acc.
Better
acc.

Properties
11
• Observation 2:
• W/ LS, the difference
between logits (i.e., 𝐱T𝒘)
of two classes has to be
limited in absolute value to
get the desired soft target
for the correct and
incorrect classes
• Note: higher absolute
values represent over-
confident predictions.
LS prevents the network
from becoming over-
confident
ㄒ
Same
acc.
Better
acc.

Properties
12
• Observation 3:
• For validation w/ LS:
Despite trying to collapse
the training examples to
tiny clusters, these NNs
generalize and calibrate
the model
in the validation set each
cluster spreads towards
the center representing
the full range of
confidences for each
prediction.
Same
acc.
Better
acc.

Summary
1. LS encourages representations in the penultimate layer to group in
tight equally distant clusters
2. LS has positive effect on model generalization and calibration.
3. Shortage:
• LS reduces mutual information (the cross-sample information to differentiate
each other), this especially hurts “knowledge distillation”
• i.e., distill the teacher’s network of knowledge into a student network
13

20211019 When does label smoothing help_shared ver

More Related Content

Similar to 20211019 When does label smoothing help_shared ver

Recently uploaded

20211019 When does label smoothing help_shared ver

Editor's Notes