2. Problem
Many large-scale datasets are collected
from websites, however they tend to contain
inaccurate labels that are termed as noisy
labels
Image :
Noisy label : dog
Clean label : horse
3. Goal
A joint optimization framework of learning
DNN parameter and estimating true labels.
Then, train a usual image classification on
these estimated labels.
4. Label
Hard-label spaces H = {y : y ϵ {0, 1}c, 1Ty = 1}
Ex : yT = [ 0, 1, 0] with c = 3
Soft-label spaces S = {y : y ϵ [0, 1]c, 1Ty = 1}
Ex : yT = [ 0.2, 0.7, 0.1] with c = 3
Parameters
c : number of classes
y : label (column vector)
1 : column vector of all one
6. The concept of joint optimization
framework
Algorithm 1 Alternating Optimization
for t 1 to num_epochs do
update θ(t+1) by SGD on L(θ(t),Y(t)|X)
update Y(t+1) by (hard-label)
or (soft-label)
end for
7. Loss – Joint Optimization Framework
►Loss function
► L(θ,Y|X) = Lc(θ,Y|X)+αLp(θ|X)+βLe(θ|X)
►Optimization
► arg min L(θ,Y|X)
►Parameters
► Y : label
► X : Image
► θ : parameters of network
► α : hyperparameter
► β : hyperparameter
8. Loss – Joint Optimization Framework
►First term
► Lc(θ,Y|X) =
1
n i=1
n
DKL(yi||s(θ, xi))
►Parameters
► Y : label
► X : image
► θ : parameters of network
► s : prediction of network
► n : train set size
9. Loss function – usual image
classification network
►Loss function
► L = −
1
n i=1
n
j=1
c
yij
GT
logsj θ, xi
►Optimization
► arg min L(θ|X,Y)
►Parameters
► L : cross entropy between probability distribution y and s
► n : train set size
► c : number of class
► Y : label (ground truth)
► s : prediction of network
10. Loss – Joint Optimization Framework
► Second term
► LP = j=1
c
pj log
pj
sj(θ,X)
► s θ, X =
1
n i=1
n
s θ, xi ≈
1
β xϵβ s(θ, x)
►Parameters
► p : prior probability distribution(distribution of
classes among all training data)
► X : image
► s : prediction of network
► θ : parameter of network
► c : number of classes
► n : train set size
► β : batch size
Ex:
In CIFAR-10, the p will be [0.1, 0.1 ,0.1, 0.1,
0.1, 0.1, 0.1, 0.1, 0.1, 0.1]. Because each
classes has the same number of images in
CIFAR-10.
11. Loss – Joint Optimization Framework
► c : number of classes
► n : train set size
►Third term
► Le = −
1
n i=1
n
j=1
c
sj(θ, xi)logsj θ, xi
► Ex:
► Epoch t : s = [0.2,0.8]
► Epoch t+1 : s = [0.1,0.9]
►Parameters
► X : image
► s : prediction of network
► θ : parameter of network
L θ, Y X = Lc θ, Y X + αLp θ X + βLe(θ|X)
12. other strategy – large learning rate
►Experiment
► test accuracy remains high
at the end of training when
the learning rate is high.
►Parameters
► X-axis : epoch
► Y-axis : test accuracy
► r : noise rate
► lr : learning rate
13. Experiment on SN-CIFAR10
best : the scores of the epoch where the validation
accuracy is optimal
last : the scores at the end of training
Test accuracy : Performance on test set
Recovery accuracy : Performance on the train set
yi =
yi
GT
with the probability of 1 − r
random one − hot vector with the probability of r
The paper I want to present is ‘Joint Optimization Framework for Learning with Noisy Labels’.
The author is Daiki Tanaka.
This paper is published on CVPR2018.
Deep Neural Networks have reached a significant performance on image classification.
However, many datasets are collected from websites.
Therefore, they tend to contain noisy labels.
These noisy labels will decrease the performance of the network.
Hence, the author propose a joint optimization framework for image classification.
This framework will estimate true labels for the classification network.
Before start, there are two kind of label for image classification.
For hard-label, the value in y is either 1 or 0, and there summation should be 1.
For soft-label, the value in y is between 0 and 1, and there summation should be 1.
X is image, Y is label, CNN is convolution neural network for image classification, L is loss function,
S is the probability prediction of network,called soft label, format is in one hot.
There are two different terms between this frame work and usual image classification framework.
The loss function and label.
They opposed to treating the label as fixed because they are noisy label.
Therefore, the labels are alternatively updated for each epoch.
Let’s look at the algorithm first.
I will explain the loss function later.
The alogorithm is simple.
In each epoch.
They just update the parameter of network by optimizer.
Then update the label by the prediction of network.
Lc is KL divergence between label and prediction of network.
When y is fixed, minimize KL divigence is the same as minimize cross entropy.
Therefore,this term is the same as the loss function of usual image classification network.
In the usual image classification network. We just use cross entropy between label and prediction of network.
Try to find a parameter theta to minimize the loss function.
Second term is the KL divergence between prior probability distribution p and mean probability s bar.
S bar is the mean probability in the training data, in the implantation , they approximinate it by batch.
However, this approximinate can not treat a large number of classes and extreme imbalanced classes.
This term will make the prediction of network follow the distribution p.
The final term is the entropy of prediction of network.
This term is requested for the training loss when we used soft label as label.
With alpha and beta is zeros and we update the label by soft label.
Both theta and label will be stuck in local optima and the learning process does not proceed.
To overcome this problem, this term will concentrate the probability distribution of each soft label to a single class.
By the experimiment, test accuracy remains high at the end of training when the learning rate is high.
Symmetric noise cifar 10 is based on cifar10 dataset and ther are probability of r to changed the label of an image.
Best, is the test accuracy on validation set.
Last, is the test accuracy on testing set.
There method reach the state of the art on CIFAR10.
They also experiment their method on AN-CIFAR10 and PL-CIFAR and the performance is well.
They use clothing1M dataset to examine the performance of their method in a real setting
The images of this dataset are crawled from online shop and the label are generated by using the surrounding texts of the images on the website.
noisy label is 61.54%
Comparable performance on the clothing1M dataset.