DCWP_CVPR2023.pptx

Geon Yeong Park, Sangmin Lee, Sang Wan Lee*, Jong Chul Ye*
KAIST
Training debiased subnetworks with
contrastive weight pruning

Background: Spurious correlation
Training
Cow in grassland Camel in desert
Green background → Cow
Desert background → Camel

Deployment
Cow in desert Camel in grassland
This is Camel This is Cow
Background  Dataset bias

Dataset Bias
Target
Cow Camel
Grass
Desert
In practice
Ideal
“Shortcut learning”

Shortcut learning: architectural design issue
Bias attribute
Invariant attribute
Biased
Any available channel transmitting the information of 𝒁𝒔𝒑
 Networks would exploit 𝑍𝑠𝑝
𝑍𝑠𝑝
𝑍𝑖𝑛𝑣

Idea: debiased neural pruning
Bias attribute
Invariant attribute
Debiased
Pruning weights on 𝒁𝒔𝒑  Reduce the effective dimension of spurious features
 Improve generalization
𝑍𝑠𝑝
𝑍𝑖𝑛𝑣

How to discover the debiased subnetworks?
Observation 1. Potential limitations of existing algorithms
Training bound Test bound

How to discover the debiased subnetworks?
Observation 1. Potential limitations of existing algorithms
Training bound Test bound
Observation 2. Importance of bias-conflicting samples
New training bound
𝝓 → 𝟏 −
𝟏
𝟐𝒑𝜼
Test bound

Motivating example – Stage 1. pretraining
𝒁𝒊𝒏𝒗
𝒆 𝒁𝒔𝒑,𝟏
𝒆 …
𝒁𝒔𝒑,𝟐
𝒆
𝒁𝒔𝒑,𝑫
𝒆
𝒀𝒆
𝑊𝑖𝑛𝑣(𝑡)
𝑋𝑒
= (𝑍𝑖𝑛𝑣
𝑒
, 𝑍𝑠𝑝
𝑒
)
𝑌𝑒
, 𝑌𝑒
, 𝑍𝑒
∈ {−1, 1}
𝑊(𝑡): Pretrained weights
= 𝒀𝒆
(𝑝𝑟𝑜𝑏 = 1) =
𝒀𝒆
, (𝑝𝑟𝑜𝑏 = 𝑝𝑒
)
−𝒀𝒆
, (𝑝𝑟𝑜𝑏 = 1 − 𝑝𝑒
)
Training
Test = 𝒀𝒆
(𝑝𝑟𝑜𝑏 = 1) =
𝒀𝒆
, (𝑝𝑟𝑜𝑏 = 0.5)
−𝒀𝒆
, (𝑝𝑟𝑜𝑏 = 0.5)
𝑊𝑠𝑝,2(𝑡)
𝑊𝑠𝑝,𝐷(𝑡)

Motivating example – Stage 2. pruning
𝑍𝑖𝑛𝑣
𝑒 𝑍𝑠𝑝,1
𝑒 …
𝑍𝑠𝑝,2
𝑒
𝑍𝑠𝑝,𝐷
𝑒
𝒀𝒆
𝑊𝑖𝑛𝑣(𝑡)
𝑊𝑠𝑝,2(𝑡)
𝑊𝑠𝑝,𝐷(𝑡)
𝝅𝒊𝒏𝒗 𝝅𝒔𝒑,𝟏 𝝅𝒔𝒑,𝟐 𝝅𝒔𝒑,𝑫
…
𝑚𝑖𝑛𝑣
= 1
Pruning parameters:
Probability of preserving weights
𝒎𝒔𝒑,𝟏
= 𝟎
𝑚𝑠𝑝,2
= 1
𝒎𝒔𝒑,𝑫
= 𝟎
⊙ ⊙ ⊙ ⊙
Example of
sampled masks
<Loss function of 𝝅>

Observation 1: difficulty of learning pruning parameters
Theorem 1. (Generalization gap)
ℓ𝑒
𝜋 ≤ 2exp(−
2 𝜋𝑖𝑛𝑣 + 2𝑝𝑒
− 1 𝛼𝑖 𝑡 𝜋𝑠𝑝,𝑖
2
4 𝛼𝑖 𝑡 2 + 1
)
• Assume that 𝑝𝑒
> 1/2 for a given training environment 𝑒 (biased setting). Then the
upper bound of ℓ𝑒
𝜋 is given as
• However, given a test environment 𝑒 with 𝑝𝑒
= 1/2,
ℓ𝑒
𝜋 ≤ 2exp(−
2𝜋𝑖𝑛𝑣
2
4 𝛼𝑖 𝑡 2 + 1
)
TL; DR reliance on 𝑧𝑠𝑝  mismatch of the bounds (Failure of standard pruning algorithms)
Where 𝛼𝑖 𝑡 > 0.

Observation 2: importance of bias-conflicting samples
𝑃𝑚𝑖𝑥
𝜂
𝑍𝑠𝑝,𝑖 𝑌 = 𝑦 = 𝝓𝑷𝒅𝒆𝒃𝒊𝒂𝒔
𝜼
𝑍 𝑌 = 𝑦 + 1 − 𝜙 𝑃𝑏𝑖𝑎𝑠
𝜂
𝑍|𝑌 = 𝑦
• Thm1: Lack of bias-conflicting samples  Preserve spurious weights
• It motivates us to analyze the behavior in another environment 𝜂 where we
can systematically augment bias-conflicting samples

Observation 2: importance of bias-conflicting samples
Theorem 2. (Training bound with the mixture distribution)
ℓ𝜂
𝜋 ≤ 2exp(−
2 𝜋𝑖𝑛𝑣 + 2𝑝𝜂
(1 − 𝜙) − 1 𝛼𝑖 𝑡 𝜋𝑠𝑝,𝑖
2
4 𝛼𝑖 𝑡 2 + 1
)
• Assume that 𝑃𝑚𝑖𝑥
𝜂
is biased. Then, 0 ≤ 𝜙 ≤ 1 −
1
2𝑝𝜂 and
• Furthermore, when 𝜙 = 1 −
1
2𝑝𝜂, the mixture distribution is debiased and
ℓ𝜂
𝜋 ≤ 2exp(−
2𝜋𝑖𝑛𝑣
2
4 𝛼𝑖 𝑡 2 + 1
)
TL; DR Generalization gap is closed by sampling from the true debiasing distribution 𝑃𝑑𝑒𝑏𝑖𝑎𝑠
𝜂

Important clues
𝑃𝑚𝑖𝑥
𝜂
𝑍𝑠𝑝,𝑖 𝑌 = 𝑦
= 𝝓𝑷𝒅𝒆𝒃𝒊𝒂𝒔
𝜼
𝑍 𝑌 = 𝑦 + 1 − 𝜙 𝑃𝑏𝑖𝑎𝑠
𝜂
𝑍|𝑌 = 𝑦
We have to:
• Approximate the unknown 𝑷𝒅𝒆𝒃𝒊𝒂𝒔
𝜼
with
existing samples
• Modify the sampling strategy to simulate
𝑃𝑚𝑖𝑥
𝜂

Objective
ℓ𝑑𝑒𝑏𝑖𝑎𝑠 𝑆; 𝑊, Θ + 𝜆ℓ1
|𝜃𝑙,𝑖|
: Uniform sparsity constraint
ℓ𝑊𝐶𝐸 𝑆; 𝑊, Θ = 𝐸𝑚~𝐺 Θ [𝜆𝑢𝑝ℓ𝑏𝑐 𝑆𝑏𝑐 + ℓ𝑏𝑎(𝑆𝑏𝑎)]
: Oversampling strategy

Model
Biased network
𝐦𝐢𝐧
𝚯
ℓ𝒅𝒆𝒃𝒊𝒂𝒔 + 𝝀ℓ𝟏
|𝚯𝒍,𝒊|
1
Debiased Subnetwork
𝑾 = 𝑾 ⊙ 𝟏(𝚯∗
> 𝟎)
2
Input
3
Finetuning
𝒙𝒂𝒍𝒊𝒈𝒏 ∈ 𝑺𝒃𝒂
𝒙𝒄𝒐𝒏𝒇𝒍𝒊𝒄𝒕 ∈ 𝑺𝒃𝒄
𝒘𝒔𝒑
𝒘𝒊𝒏𝒗
Over-
sampling
𝒁𝒔𝒑
𝒁𝒊𝒏𝒗

Results: Unbiased test accuracy
(a) CMNIST
(b) CIFAR10-C (c) BFFHQ

Results: ablation study
Pruning contributes significantly: (1→2, +7.19%),
(3→5, +11.59%) or (4→6, +8.68%).
The proposed method does not require
weight reset [ref]
[ref]: Frankle, J., & Carbin, M. (2018). The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635.

Results: sparsity&sensitivity analysis
Accuracy increases as more (potentially
biased) weights are pruned out
• Trade-off between performance and sparsity does exist
• Proposed framework is reasonably tolerant to high sparsity.

Results: dependency on bias-capturing models
DCWP may perform reasonably well with the limited number and
quality of bias conflicting samples.

Summary
• Presented a novel functional subnetwork probing method for OOD generalization.
• We provided theoretical insights and empirical evidence to show that the minority
samples provide an important clue for probing the optimal unbiased subnetworks.
• The proposed method is memory efficient and potentially compatible with many other
debiasing methods.

DCWP_CVPR2023.pptx

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to DCWP_CVPR2023.pptx

Similar to DCWP_CVPR2023.pptx (20)

Recently uploaded

Recently uploaded (20)

DCWP_CVPR2023.pptx