5. Shortcut learning: architectural design issue
Bias attribute
Invariant attribute
Biased
Any available channel transmitting the information of 𝒁𝒔𝒑
Networks would exploit 𝑍𝑠𝑝
𝑍𝑠𝑝
𝑍𝑖𝑛𝑣
6. Idea: debiased neural pruning
Bias attribute
Invariant attribute
Debiased
Pruning weights on 𝒁𝒔𝒑 Reduce the effective dimension of spurious features
Improve generalization
𝑍𝑠𝑝
𝑍𝑖𝑛𝑣
7. How to discover the debiased subnetworks?
Observation 1. Potential limitations of existing algorithms
Training bound Test bound
8. How to discover the debiased subnetworks?
Observation 1. Potential limitations of existing algorithms
Training bound Test bound
Observation 2. Importance of bias-conflicting samples
New training bound
𝝓 → 𝟏 −
𝟏
𝟐𝒑𝜼
Test bound
10. Motivating example – Stage 2. pruning
𝑍𝑖𝑛𝑣
𝑒 𝑍𝑠𝑝,1
𝑒 …
𝑍𝑠𝑝,2
𝑒
𝑍𝑠𝑝,𝐷
𝑒
𝒀𝒆
𝑊𝑖𝑛𝑣(𝑡)
𝑊𝑠𝑝,2(𝑡)
𝑊𝑠𝑝,𝐷(𝑡)
𝝅𝒊𝒏𝒗 𝝅𝒔𝒑,𝟏 𝝅𝒔𝒑,𝟐 𝝅𝒔𝒑,𝑫
…
𝑚𝑖𝑛𝑣
= 1
Pruning parameters:
Probability of preserving weights
𝒎𝒔𝒑,𝟏
= 𝟎
𝑚𝑠𝑝,2
= 1
𝒎𝒔𝒑,𝑫
= 𝟎
⊙ ⊙ ⊙ ⊙
Example of
sampled masks
<Loss function of 𝝅>
11. Observation 1: difficulty of learning pruning parameters
Theorem 1. (Generalization gap)
ℓ𝑒
𝜋 ≤ 2exp(−
2 𝜋𝑖𝑛𝑣 + 2𝑝𝑒
− 1 𝛼𝑖 𝑡 𝜋𝑠𝑝,𝑖
2
4 𝛼𝑖 𝑡 2 + 1
)
• Assume that 𝑝𝑒
> 1/2 for a given training environment 𝑒 (biased setting). Then the
upper bound of ℓ𝑒
𝜋 is given as
• However, given a test environment 𝑒 with 𝑝𝑒
= 1/2,
ℓ𝑒
𝜋 ≤ 2exp(−
2𝜋𝑖𝑛𝑣
2
4 𝛼𝑖 𝑡 2 + 1
)
TL; DR reliance on 𝑧𝑠𝑝 mismatch of the bounds (Failure of standard pruning algorithms)
Where 𝛼𝑖 𝑡 > 0.
12. Observation 2: importance of bias-conflicting samples
𝑃𝑚𝑖𝑥
𝜂
𝑍𝑠𝑝,𝑖 𝑌 = 𝑦 = 𝝓𝑷𝒅𝒆𝒃𝒊𝒂𝒔
𝜼
𝑍 𝑌 = 𝑦 + 1 − 𝜙 𝑃𝑏𝑖𝑎𝑠
𝜂
𝑍|𝑌 = 𝑦
• Thm1: Lack of bias-conflicting samples Preserve spurious weights
• It motivates us to analyze the behavior in another environment 𝜂 where we
can systematically augment bias-conflicting samples
13. Observation 2: importance of bias-conflicting samples
Theorem 2. (Training bound with the mixture distribution)
ℓ𝜂
𝜋 ≤ 2exp(−
2 𝜋𝑖𝑛𝑣 + 2𝑝𝜂
(1 − 𝜙) − 1 𝛼𝑖 𝑡 𝜋𝑠𝑝,𝑖
2
4 𝛼𝑖 𝑡 2 + 1
)
• Assume that 𝑃𝑚𝑖𝑥
𝜂
is biased. Then, 0 ≤ 𝜙 ≤ 1 −
1
2𝑝𝜂 and
• Furthermore, when 𝜙 = 1 −
1
2𝑝𝜂, the mixture distribution is debiased and
ℓ𝜂
𝜋 ≤ 2exp(−
2𝜋𝑖𝑛𝑣
2
4 𝛼𝑖 𝑡 2 + 1
)
TL; DR Generalization gap is closed by sampling from the true debiasing distribution 𝑃𝑑𝑒𝑏𝑖𝑎𝑠
𝜂
14. Important clues
𝑃𝑚𝑖𝑥
𝜂
𝑍𝑠𝑝,𝑖 𝑌 = 𝑦
= 𝝓𝑷𝒅𝒆𝒃𝒊𝒂𝒔
𝜼
𝑍 𝑌 = 𝑦 + 1 − 𝜙 𝑃𝑏𝑖𝑎𝑠
𝜂
𝑍|𝑌 = 𝑦
We have to:
• Approximate the unknown 𝑷𝒅𝒆𝒃𝒊𝒂𝒔
𝜼
with
existing samples
• Modify the sampling strategy to simulate
𝑃𝑚𝑖𝑥
𝜂
18. Results: ablation study
Pruning contributes significantly: (1→2, +7.19%),
(3→5, +11.59%) or (4→6, +8.68%).
The proposed method does not require
weight reset [ref]
[ref]: Frankle, J., & Carbin, M. (2018). The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635.
19. Results: sparsity&sensitivity analysis
Accuracy increases as more (potentially
biased) weights are pruned out
• Trade-off between performance and sparsity does exist
• Proposed framework is reasonably tolerant to high sparsity.
20. Results: dependency on bias-capturing models
DCWP may perform reasonably well with the limited number and
quality of bias conflicting samples.
22. Summary
• Presented a novel functional subnetwork probing method for OOD generalization.
• We provided theoretical insights and empirical evidence to show that the minority
samples provide an important clue for probing the optimal unbiased subnetworks.
• The proposed method is memory efficient and potentially compatible with many other
debiasing methods.