- The authors propose a method called PRODEN for partial-label learning that is model-, loss-, and optimizer-agnostic.
- PRODEN introduces a classifier-consistent risk estimator and dynamically updates label weights during training to guide the model towards the true labels.
- In experiments on benchmark and real-world datasets, PRODEN achieves performance comparable to oracle labels and outperforms other partial-label learning baselines, demonstrating its effectiveness.
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Progressive identification of true labels for partial label learning
1. Progressive Identification of True Labels for
Partial-Label Learning
발표자: 송헌
펀더멘털팀: 김동희, 김지연, 김창연, 이근배, 이재윤
Lv, Jiaqi, et al. ICCV. 2020.
2. 2
Problem setting
In partial-label learning (PLL), each training instance is associated with
a set of candidate labels among which exactly one is true.
The goal of PLL is reducing the overhead of
finding exact label from ambiguous candidates.
3. 3
Related works
Most works are coupled to some specific optimization algorithms.
Therefore, it is difficult to apply them to DNNs.
D2CNN* is the only work that used DNNs with stochastic optimizers.
However, it restricted the networks to some specific architectures.
Complementary-label learning** uses a class that an example does not belong to.
Hence, it can be considered as an extreme PLL case with 𝑐 − 1 candidate labels.
*Yao, et al. "Deep discriminative cnn with temporal ensembling for ambiguously-labeled image classification." AAAI. 2020.
**Ishida, Takashi, et al. "Complementary-label learning for arbitrary losses and models." ICML. 2019
4. 4
Contributions
In the paper,
The authors propose a classifier-consistent risk estimator for PLL theoretically.
They show the classifier learned from partially labeled data converges to
the optimal one learned from ordinarily labeled data.
The authors also propose a model-, loss-, optimizer-agnostic method for PLL.
5. 5
Ordinary Multi-class Classification
Let 𝒳 ⊆ ℝ!
be the instance space and 𝒴 = 1,2, … , 𝑐 be the label space.
Let 𝑝 𝑥, 𝑦 be the underlying joint density of random variables 𝑋, 𝑌 ∈ 𝒳×𝒴.
The goal is to learn a classifier 𝒈: 𝒳 → ℝ" that minimizes the estimator of risk:
ℛ 𝒈 = 𝔼 #,% ~' (,) ℓ 𝒈 𝑋 , 𝑒%
where 𝒆𝒴
= 𝒆+
: 𝑖 ∈ 𝒴 denotes the standard canonical vector.
6. 6
Partial-Label Learning
Let candidate label set 𝑆 be the power set of true label set .
Therefore, we need to train a classifier with partially labeled examples 𝑋, 𝑆 .
The PLL risk estimator is defined over 𝑝 𝑥, 𝑠 :
ℛ,-- 𝒈 = 𝔼 #,. ~' (,/ ℓ,-- 𝒈 𝑋 , 𝑆
where ℓ,--: ℝ"
×𝒫 𝒴 → ℝ.
7. 7
Classifier-Consistent Risk Estimator
To make ℛ,-- 𝒈 estimable, an intuitive way is through a surrogate loss.
The authors consider that only the true label contributes to retrieving the classifier:
For that, they define the PLL loss as the minimal loss over the candidate label set:
ℓ,-- 𝒈 𝑋 , 𝑆 = min
+∈.
ℓ 𝒈 𝑋 , 𝒆+
This leads to a new risk estimator:
ℛ,-- 𝒈 = 𝔼 #,. ~' (,/ min
+∈.
ℓ 𝒈 𝑋 , 𝒆+
8. 8
Lemmas
The ambiguity degree is defined as
𝛾 = sup
1,2 ~3 4,5 ,6~3 /|(,) , 8
%∈𝒴, 8
%9%
Pr H
𝑌 ∈ 𝑆
𝛾 is the maximum probability of a negative label H
𝑌 co-occurs with the true label 𝑌.
The small ambiguity degree condition (𝛾 < 1) implies that except for the true label,
no other labels will be 100% included in the candidate label set.
Moreover, if ℓ is the CE or MSE loss, the ordinary optimal classifier 𝒈∗
satisfies
𝑔+
∗
𝑋 = 𝑝 𝑌 = 𝑖 𝑋 .
9. 9
Connection
Under the deterministic scenario,
if the small ambiguity degree condition is satisfied,
and CE or MSE loss is used, then,
the PLL optimal classifier 𝒈𝑷𝑳𝑳
∗
of ℛ,-- 𝒈 is equivalent to
the ordinary optimal classifier 𝒈∗
of ℛ 𝒈 :
𝒈𝑷𝑳𝑳
∗
= 𝒈∗
10. 10
Estimation Error Bound
Let L
ℛ,-- be the empirical counterpart of ℛ,--, and M
𝑔,-- = argmin L
ℛ,-- 𝒈 be the
empirical risk classifier. Suppose 𝒢) be a class of real functions.
Rademacher complexity of 𝒢) over 𝑝 𝑥 with sample size 𝑛 is defined as ℜ= 𝒢) .
Then, for any 𝛿 > 0, we have with probability as least 1 − 𝛿,
ℛ,-- M
𝑔,-- − ℛ,-- 𝒈𝑷𝑳𝑳
∗
≤ 4 2𝑐𝐿ℓ Y
)?@
"
ℜ= 𝒢) + 2𝑀
log
2
𝛿
2𝑛
Therefore, ℛ,-- M
𝑔,-- → ℛ,-- 𝒈𝑷𝑳𝑳
∗
as the number of training data 𝑛 → ∞.
11. 11
Proposed Method
However, the min operator in ℓ,-- 𝒈 𝑋 , 𝑆 makes optimization difficult,
because if a wrong label 𝑖 is selected in the beginning,
the optimization will focus on the wrong label till the end.
They first require that ℓ can be decomposed into each label
ℓ 𝒈 𝑋 , 𝒆% = Y
+?@
"
ℓ 𝑔+ 𝑋 , 𝑒+
%
Then, the authors relax the min operator by the dynamic weights.
L
ℛ,-- =
1
𝑛
Y
+?@
=
Y
A?@
"
𝑤+,Aℓ 𝑔A 𝑥+ , 𝑒A
/!
where 𝑒A
/!
is the 𝑗-th coordinate of 𝒆/! and 𝒆/! = ∑B∈/!
𝒆B
12. 12
Proposed Method
Ideally, the label with weight 1 is exactly the true label and 0 otherwise.
Since the weights are latent, the minimizer of L
ℛ,-- cannot be solved directly.
Inspiring by the EM algorithm, the authors put more weights on more possible labels:
𝑤+,A = b
𝑔A 𝑥+
∑B∈/!
𝑔B 𝑥+
, 𝑗 ∈ 𝑠+
0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
If the small ambiguity degree condition is satisfied, models tend to remember the
true labels in the initial epochs, which guides the model towards a discriminative
classifier giving relatively low losses for more possible true labels.
13. 13
Proposed Method
While they follow the EM algorithm,
they merge the E-step and M-step.
The weights can be updated at any epoch
such that the local convergence within
each epoch is not necessary.
Therefore, they gets rid of the overfitting
issues of EM methods.
14. 14
Datasets
The authors used widely used benchmark datasets,
MNIST, Fashion-MNIST, Kuzushiji-MNIST, and CIFAR10
And five small datasets from UCI,
Yeast, Texture, Dermatology, Synthetic Control, and 20Newgroups.
They randomly flipped the negative label to positive label with probability 𝑞.
Moreover, they used real-world partial-label datasets,
Lost, Birdsong, MSRCv2, Soccer Player, and Yahoo! News.
15. 15
Baselines
They compared the proposed method (PRODEN) with:
• PRODEN-itera: update the label weights every 100 epoch
• PRODEN-sudden: update weights 𝑤+,B = 1 if argmaxA∈/!
𝑔A(𝑥+) and 0 otherwise
• PRODEN-naïve: never update the weights but use uniform weights
• PN-oracle: train a model with ordinary labels
• PN-decomp: decompose one instance with multiple candidate labels
into many instances each one single label
• D2CNN: a PLL method based on DNN
• GA: a CLL method based on DNN
16. 16
Results on Benchmark Datasets
When 𝑞 = 0.1, PRODEN is always the best method and comparable to PN-oracle.
The performance of PRODEN-itera deteriorates drastically with complex models
because of the overfitting issues.
17. 17
Results on Benchmark Datasets
When 𝑞 = 0.7, PRODEN is still comparable to PN-oracle.
The superiority always stands out for PRODEN compared with D2CNN and GA.
18. 18
Analysis on the Ambiguity Degree
They also gradually move 𝑞 from 0.5 to 0.9 to simulate 𝛾(𝛾 → 𝑞 as 𝑛 → ∞).
PRODEN tends to be less affected with increased ambiguity.
19. 19
Results on Real-world Datasets
They compare the proposed method with classical PLL methods,
SURE, CLPL, ECOC, PLSVM, PLkNN, and IPAL
which can hardly be implemented by DNNs on real-world and small-scale datasets.